March 25, 2012

A comic look at Haskell

Well, if you're one of the 2 people that read my last set of blog articles, Restringing a Racket with Haskell, you'll know that Alex is attempting to branch out from the ancient esoteric art of Emacs Lisp hacking to learn the well-mannered sport of Racket. I, on the other hand, am trying to steer him towards the abyss. As Oliver Wendell Holmes once said, "The mind, once expanded to the dimensions of larger ideas, never returns to its original size." Here's to Alex with a big head!

Now last time I made a wild and unsubstantiated claim that Haskell makes me a better programmer, and then proceeded to show you a bunch of boring bits about trying to get a silly HTML microdata extractor to work. Well, this time, you'll be happy to know that I'm going to stand by my claim, and likewise go on to bore you with the likes of foreign functions and lazy IO. Why am I wasting your time, you ask? First, you're the one chosing to read this! Second, we've got to learn a little bit of the Haskell landscape before the rockinghorse people can get down to gnashing those marshmallow pies. So if I can get Alex, or anyone else out there in the studio audience over the initial hurdle, I've done my bit.

So this time, I was tossed a comic book archive file (.cbz) and told that our goal was to create a little web app that displays the comic book pages. Along with this, I was given the following bit of Racket foreign function interface to libarchive, a nifty little package developed by another friend of ours, Tim, for reading all sorts of archive formats.

#lang racket/base
(require ffi/unsafe
         ffi/unsafe/define
         ffi/unsafe/alloc
         racket/match)

(define-ffi-definer define-archive (ffi-lib "libarchive" (list "12")))

(define ARCHIVE_OK 0)
(define ARCHIVE_EOF 1)

(define _archive-pointer (_cpointer 'archive))
(define _archive_entry-pointer (_cpointer 'archive-entry))

(define-archive archive_read_close
  (_fun _archive-pointer -> _int) #:wrap (deallocator))
(define-archive archive_read_free
  (_fun _archive-pointer -> _int) #:wrap (deallocator))
(define-archive archive_read_new (_fun -> _archive-pointer)
  #:wrap (allocator archive_read_free))
(define-archive archive_read_support_compression_all
  (_fun _archive-pointer -> _int))
(define-archive archive_read_support_format_all
  (_fun _archive-pointer -> _int))
(define-archive archive_read_open_file
  (_fun _archive-pointer _string _int -> _int)
  #:wrap (allocator archive_read_close))
(define-archive archive_read_next_header
  (_fun _archive-pointer (o : (_ptr o _archive_entry-pointer))
        -> (r : _int)
        -> (values o r)))
(define-archive archive_entry_pathname
  (_fun _archive_entry-pointer -> _string))
(define-archive archive_entry_size
  (_fun _archive_entry-pointer -> _int))
(define-archive archive_entry_filetype
  (_fun _archive_entry-pointer -> _int))
(define-archive archive_read_data
  (_fun _archive-pointer _pointer _int -> _int))

(define (archive-read-current-file archive entry-size)
  (let ([buffer (malloc 'atomic entry-size)])
    (memset buffer 0 entry-size)
    (archive_read_data archive buffer entry-size)
    (make-sized-byte-string buffer entry-size)))

(struct archive-entry (path size filetype) #:transparent)

(define (rewrite-file-mode mode)
  (match mode
    [#o0170000 'fmt]
    [#o0100000 'file]
    [#o0120000 'link]
    [#o0140000 'sock]
    [#o0020000 'chr]
    [#o0060000 'blk]
    [#o0040000 'dir]
    [#o0010000 'fifo]))

(define (archive-read-next-file-entry archive)
  (let-values ([(entry ret) (archive_read_next_header archive)])
    (if (= ret ARCHIVE_OK)
        (archive-entry (archive_entry_pathname entry)
                       (archive_entry_size entry)
                       (rewrite-file-mode (archive_entry_filetype entry)))
        (if (= ret ARCHIVE_EOF)
            #f
            (raise (list "error" ret))))))

(define (archive-folder archive proc init-val)
  (match (archive-read-next-file-entry archive)
    [(archive-entry path size filetype)
     (archive-folder
      archive proc
      (apply proc
             (list (archive-entry path size filetype)
                   (lambda () (archive-read-current-file archive size))
                   init-val)))]
    [#f init-val]))

(define (archive-fold-files archive-path proc init-val)
  (let ([archive (archive_read_new)]) ; check that archive != NULL
    (archive_read_support_compression_all archive) ; check return
    (archive_read_support_format_all archive) ; check return
    (archive_read_open_file archive archive-path 10240) ; check return
    (archive-folder archive proc init-val)))

(archive-fold-files
"/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr"
 (lambda (entry reader val)
   (if (equal? (archive-entry-filetype entry) 'file)
       (cons (cons (archive-entry-path entry)
                   (let ([data (reader)])
                    (subbytes data 0 (min (bytes-length data) 10)))) val)
       val))
 '())

;; (archive-fold-files
;;  "/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr"
;;  (lambda (entry reader val)
;;    (cons (cons (archive-entry-path entry) (archive-entry-size entry)) val))
;;  '())

(provide archive-fold-files
         archive-entry)

As you can see from this file, there are a bunch of actual foreign function definitions (like archive_read_close), and then a number of higher-level routines to make it more convenient to get at an archive's contents.

Before we can begin translating this to Haskell foreign functions, there's a lot to learn about how Haskell's FFI works. There are a bunch of articles on that topic on the Haskell Wiki, as well as a very nice series of articles at Inside 233. But like always, I'm just going to cut to the chase and give you the answer, because we're both in a hurry.

To do this, we're going to use Haskell's lowest-level tool, a preprocessor called "hsc2hs" to convert a foreign function definition into working Haskell code. There are several other ways to generate FFI interfaces for Haskell, but hsc2hs isn't too painful, and it's always good to know the basics. Maybe later we'll revisit this with c2hs or something else.

FFI Boilerplate

We start by creating a .hsc file in the right place for Haskell's module hierarchy. If we look on Hackage, we see that many archive handling packages are in the Codec namespace. So we'll do the same and create a file called "Codec/Archive/Libarchive.hsc" and give it some initial boilerplate:

{-# LANGUAGE ForeignFunctionInterface, EmptyDataDecls, DeriveDataTypeable,
             ScopedTypeVariables #-}

#include <archive.h>
#include <archive_entry.h>

module Codec.Archive.Libarchive where

import Data.ByteString (ByteString, replicate, take)
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Unsafe (unsafePackCString, unsafeUseAsCStringLen,
                               unsafePackCStringFinalizer)
import Foreign.C (CInt, CSize, CString, withCAString)
import Foreign.Marshal.Alloc (alloca, free)
import Foreign.Marshal.Array (mallocArray)
import Foreign.Storable (peek)
import Foreign.Ptr (Ptr, FunPtr, castFunPtr, castPtr)
import Foreign.ForeignPtr (ForeignPtr, newForeignPtr, newForeignPtr_,
                           withForeignPtr)
import Prelude hiding (replicate, take)
import System.IO.Unsafe (unsafeInterleaveIO)

I've made this easy for you to see which functions come from which modules, in case you want to go look them up. You really don't need to declare the individual functions you use in this way, provided that there aren't any conflicts between modules. (You'll notice that there is a conflict in the above between ByteString's replicate and take which I'm hiding from the Prelude package. Also the Data.ByteString.Lazy module defines a whole parallel set of lazy ByteString functions that we'll get at by using the "L" prefix.)

Now one cool thing about hsc2hs is that you can just include C headers directly, like I've done with the 2 libarchive headers. We'll see how these get used in just a few minutes. But first a bit of bad news. Unlike normal Haskell .hs files, you can't sit in haskell-mode in Emacs and evaluate .hsc files in the same way. I know that's really disappointing, but maybe some Emacs Lisp hacker will come along and fix that sad situation eventually. In the mean time, we'll resort to Haskell's big fancy make system, Cabal, to build this file. Here's the cabal definition, a file we'll name "libarchive.cabal" that lives in our top-level directory:

name: libarchive
version: 0.1
license: BSD3
cabal-version: >= 1.8
copyright: (c) 2012 Warren Harris
author: Warren Harris <warrensomebody@gmail.com>
maintainer: Warren Harris <warrensomebody@gmail.com>
stability: alpha
synopsis: libarchive bindings
description: libarchive bindings
category: Codec
build-type: Simple

Library
    exposed-modules:    Codec.Archive.Libarchive
    build-depends:      base >= 3 && < 5,
                        bytestring >= 0.9
    extra-libraries:    archive
    ghc-options:        -Wall

First a little aside: If you're on a Mac and you don't yet have libarchive installed, you can get it via macports. I tried this the other day, and decided to port upgrade outdated at the same time. Eleven (yes, 11) hours later, I had all sorts of shiny new toys like gcc45 and Erlang (Erlang?). This was the last straw, so I decided to switch to Homebrew which provides a much more sane package manager for the Mac. Unfortunately, you need a bit of magic to install the development headers for libarchive with Homebrew:

$ brew tap homebrew/dupes
$ brew install `brew --prefix`/Library/Formula/libarchive.rb

Now we're ready to configure and build our (empty) libarchive module::

$ cabal configure
$ cabal install

We'll need to re-run the cabal install command whenever we make a change to our Libarchive.hsc file. At least until we get that Emacs integration.

Our first foreign functions

Now, let's create our first foreign function definitions, starting with the interface to archive_entry. We'll put these at the end of the Libarchive.hsc file:

data ARCHIVE_ENTRY      -- native type
newtype ArchiveEntry = ArchiveEntry { unArchiveEntry :: ForeignPtr ARCHIVE_ENTRY }
                     deriving (Show, Eq)

foreign import ccall unsafe "archive.h archive_entry_pathname"
  archive_entry_pathname :: Ptr ARCHIVE_ENTRY -> IO CString

archiveEntryPathname :: ArchiveEntry -> IO ByteString
archiveEntryPathname ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    cstr <- archive_entry_pathname a
    unsafePackCString cstr

foreign import ccall unsafe "archive.h archive_entry_size"
  archive_entry_size :: Ptr ARCHIVE_ENTRY -> IO CInt

archiveEntrySize :: ArchiveEntry -> IO Int
archiveEntrySize ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    sz <- archive_entry_size a
    return $ fromIntegral sz

foreign import ccall unsafe "archive.h archive_entry_filetype"
  archive_entry_filetype :: Ptr ARCHIVE_ENTRY -> IO CInt

archiveEntryFiletype :: ArchiveEntry -> IO Int
archiveEntryFiletype ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    mode <- archive_entry_filetype a
    return $ fromIntegral mode

The first line defines a new data type, ARCHIVE_ENTRY, with no constructors. The only reason for this is so that we can distinguish native pointers to C archive_entry structures from other kinds of pointers. In Haskell, you'll see Ptr ARCHIVE_ENTRY as the type of the first argument of the archive_entry accessors.

The second line defines a higher-level wrapper type for archive_entry structures. Here we point to our archive_entry with a Haskell ForeignPtr instead of a regular Ptr. ForeignPtrs allow us to associate a finalizer callback with a C pointer, so that when the garbage collector decides the object is no longer needed, the C structure can be automatically freed. You aren't required to design foreign function interfaces with ForeignPtrs. One can always provide APIs that require the user to explicitly free structures, or manage them in a block scope -- but here using finalizers seems like a good way to go. Now, unfortunately in the above routines we don't actually see the finalizer get associated with the instances. That's because all these structs are managed by the archive structure, and its finalizer will automatically destroy the archive_entrys. So we'll come back to this finalizer thing later.

Another thing: newtype is a cool way to define a type that incurs no runtime overhead. Unlike typedef in C that defines type synonyms, newtype defines a type that's logically distinct from the thing it wraps, so you can't accidentally use a ForeignPtr ARCHIVE_ENTRY when an ArchiveEntry is needed. The ArchiveEntry type constructor that wraps these values is only there at compile time, so there's no runtime overhead for using a newtype.

Finally, you'll see that each of these foreign functions definitions has a low-level C function declaration -- the foreign import statements with associated type information, and the higher-level user entry points that deal with unwrapping the ForeignPtrs and converting from C types like CInt to more convenient Haskell types like Int. Let's look at the first foreign import statement, archive_entry_pathname. It uses the unsafe declaration to say that this function is "unsafe" in the sense that it can't call back into the Haskell runtime. If it did, the Haskell runtime would become seriously confused. However, be aware that even if your native function doesn't call back into Haskell, you might want to declare a function safe if you want to allow Haskell's lightweight threads to preempt it. All unsafe functions run to completion, and hog the entire process while they're running.

All of these foreign import declarations also state that the native functions are in the IO monad (the final result such as CString is wrapped with IO). This might seem odd for a function like archive_entry_size which is purely functional in nature. The reason I'm doing this is due to the way libarchive works. Since it streams through the archive file's contents, if you call archive_entry_size after you move on to the next archive header, it returns 0. So it's really critical that when you use it in the IO monad it really runs right away, and isn't run lazily when the value is actually needed. If you allow it to run lazily, it runs too late and comes up empty handed. I found this out the hard way.

This brings us to an interesting point about the IO monad. Like all monads, the IO monad is about sequencing operations together, so that they happen in the right order. However, unlike most monads, the IO monad is strict (i.e. not lazy), which means that the operations happen as soon as you call the functions, rather than when the monad's result value is really needed. We'll come back to this when we talk about lazy IO later.

Moving on to the higher-level functions. First, I've named this by the standard Haskell camel-case naming conventions to distinguish them from the low-level routines. You'll notice that my module declaration doesn't declare which functions are exported (so everything will be exported by default), but the intention is to simply export the high-level functions and types, and hide the lower-level (and possibly dangerous) native methods. Each high-level function like archiveEntryPathname first unwraps the ArchiveEntry type by calling the unArchiveEntry record accessor to get at the ForeignPtr, and then uses withForeignPtr to get at the low-level Ptr ARCHIVE_ENTRY. This is then passed to the native method along with other arguments (translated into their C types), and the results are similarly translated back into Haskell types. For instance, unsafePackCString creates a ByteString from a null-terminated CString, and fromIntegral is the numeric workhorse that converts all sorts of numbers to whatever's needed.

Now one more thing we'll need when we call archiveEntryFiletype is a set of constants for the various modes it returns (things like AE_IFREG in the "archive.h" header). We could have defined these with data constructors, but since I'm lazy I'll just define a set of octal constants here. Note that they have to begin with a lower-case letter since they're values not types:

ae_ifmt :: Int
ae_ifmt = 0o170000

ae_ifreg :: Int
ae_ifreg = 0o100000

ae_iflnk :: Int
ae_iflnk = 0o120000

ae_ifsock :: Int
ae_ifsock = 0o140000

ae_ifchr :: Int
ae_ifchr = 0o020000

ae_ifblk :: Int
ae_ifblk = 0o060000

ae_ifdir :: Int
ae_ifdir = 0o040000

ae_ififo :: Int
ae_ififo = 0o010000

Skipping right along...

Before we define foreign functions for archive operations, let's first define some routines for dealing with these function's return types. Instead of testing an integer result, let's use a data type to define the possible result codes, and some helper functions for calling these functions and converting their int results to this type:

data Result = OK | EOF | RETRY | WARN | FAILED | FATAL
            deriving (Eq, Show)

result :: CInt -> Result
result 0 = OK
result 1 = EOF
result (-10) = RETRY
result (-20) = WARN
result (-25) = FAILED
result (-30) = FATAL
result _ = FATAL  -- unexpected result

withResult :: (Ptr ARCHIVE -> IO CInt) -> Archive -> IO Result
withResult f ar =
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- f a
    return $ result rv

expectOK :: IO Result -> IO ()
expectOK m = m >>= go
  where go OK = return ()
        go err = ioError $ userError (show err)

The function expectOK is something that wraps an IO operation returning a Result into one that throws an exception instead. This is just for the cases where we'd like to streamline the common case, and not deal with error handling every step of the way. I'm not sure that this function makes any sense in a production-quality version of this library, but we'll use it here when it's convenient. Also note that I originally wrote the last line as go err = error (show err) (err is the Result value, and error is a function that aborts the computation by raising an error). However, the exceptions raise by error can't be caught like those by ioError, so I switched.

Finally, here are the rest of the FFI definitions we'll need. They aren't all that interesting after you've groked the above, so I'm not going to say any more about them:

data ARCHIVE     -- native type
newtype Archive = Archive { unArchive :: ForeignPtr ARCHIVE }
                deriving (Show, Eq)

foreign import ccall unsafe "archive.h archive_read_new"
  archive_read_new :: IO (Ptr ARCHIVE)

foreign import ccall unsafe "archive.h &archive_read_finish"
  p_archive_read_finish :: FunPtr (Ptr ARCHIVE -> IO CInt)

archiveReadNew :: IO Archive
archiveReadNew = do
  ar <- archive_read_new
  fp <- newForeignPtr (castFunPtr p_archive_read_finish) ar
  return $ Archive fp

foreign import ccall unsafe "archive.h archive_read_support_compression_all"
  archive_read_support_compression_all :: Ptr ARCHIVE -> IO CInt

archiveReadSupportCompressionAll :: Archive -> IO Result
archiveReadSupportCompressionAll ar =
  withResult archive_read_support_compression_all ar

foreign import ccall unsafe "archive.h archive_read_support_format_all"
  archive_read_support_format_all :: Ptr ARCHIVE -> IO CInt

archiveReadSupportFormatAll :: Archive -> IO Result
archiveReadSupportFormatAll ar =
  withResult archive_read_support_format_all ar

foreign import ccall unsafe "archive.h archive_read_open_file"
  archive_read_open_file :: Ptr ARCHIVE -> CString -> CSize -> IO CInt

archiveReadOpenFile :: Archive -> String -> Int -> IO Result
archiveReadOpenFile ar filename blocksize =
  withCAString filename $ \c_filename -> do
  let sz = fromIntegral blocksize
  withResult (\a -> archive_read_open_file a c_filename sz) ar

foreign import ccall unsafe "archive.h archive_read_next_header"
  archive_read_next_header :: Ptr ARCHIVE -> Ptr (Ptr ARCHIVE_ENTRY) -> IO CInt

archiveReadNextHeader :: Archive -> IO (Either Result ArchiveEntry)
archiveReadNextHeader ar =
  alloca $ \ptr -> do
    rv <- withResult (\a -> archive_read_next_header a ptr) ar
    case rv of
      OK -> do entry <- peek ptr
               fp <- newForeignPtr_ entry
               return $ Right $ ArchiveEntry fp
      err -> return $ Left err

foreign import ccall unsafe "archive.h archive_read_close"
  archive_read_close :: Ptr ARCHIVE -> IO CInt

archiveReadClose :: Archive -> IO Result
archiveReadClose ar = withResult archive_read_close ar

foreign import ccall unsafe "archive.h archive_read_finish"
  archive_read_finish :: Ptr ARCHIVE -> IO CInt

archiveReadFinish :: Archive -> IO Result
archiveReadFinish ar = withResult archive_read_finish ar

foreign import ccall unsafe "archive.h archive_read_data"
  archive_read_data :: Ptr ARCHIVE -> Ptr CString -> CSize -> IO CSize

archiveReadData :: Archive -> ByteString -> IO Int
archiveReadData ar bs =
  withForeignPtr (unArchive ar) $ \a ->
    unsafeUseAsCStringLen bs $ \(buf, len) -> do
    rv <- archive_read_data a (castPtr buf) (fromIntegral len)
    return $ fromIntegral rv

Ok, I will say one thing. In archiveReadNextHeader, you can see that the native archive_entry is wrapped as a ForeignPtr using newForeignPtr_. Unlike newForeignPtr, newForeignPtr_ doesn't take a finalizer. This is because the archive_entry struct will be destroyed when we destroy the archive structure, not when the Haskell ArchiveEntry object goes out of scope. However, in archiveReadNew you can see where we do associate the archive_read_finish native function as a destructor for the Haskell Archive object. (Note how we have to use FunPtr when we want a native pointer to a function. Also, the castPtr is used because we're ignoring the result of archive_read_finish. Haskell expects finalizers to return IO ().)

Next, in part 2 we'll get down to translating higher-level Racket routines like archive-fold-files. Don't doze off yet...

No comments:

Post a Comment