Well, if you're one of the 2 people that read my last set of blog articles, Restringing a Racket with Haskell, you'll know that Alex is attempting to branch out from the ancient esoteric art of Emacs Lisp hacking to learn the well-mannered sport of Racket. I, on the other hand, am trying to steer him towards the abyss. As Oliver Wendell Holmes once said, "The mind, once expanded to the dimensions of larger ideas, never returns to its original size." Here's to Alex with a big head!
Now last time I made a wild and unsubstantiated claim that Haskell makes me a better programmer, and then proceeded to show you a bunch of boring bits about trying to get a silly HTML microdata extractor to work. Well, this time, you'll be happy to know that I'm going to stand by my claim, and likewise go on to bore you with the likes of foreign functions and lazy IO. Why am I wasting your time, you ask? First, you're the one chosing to read this! Second, we've got to learn a little bit of the Haskell landscape before the rockinghorse people can get down to gnashing those marshmallow pies. So if I can get Alex, or anyone else out there in the studio audience over the initial hurdle, I've done my bit.
So this time, I was tossed a comic book archive file (.cbz) and told that our goal was to create a little web app that displays the comic book pages. Along with this, I was given the following bit of Racket foreign function interface to libarchive, a nifty little package developed by another friend of ours, Tim, for reading all sorts of archive formats.
#lang racket/base (require ffi/unsafe ffi/unsafe/define ffi/unsafe/alloc racket/match) (define-ffi-definer define-archive (ffi-lib "libarchive" (list "12"))) (define ARCHIVE_OK 0) (define ARCHIVE_EOF 1) (define _archive-pointer (_cpointer 'archive)) (define _archive_entry-pointer (_cpointer 'archive-entry)) (define-archive archive_read_close (_fun _archive-pointer -> _int) #:wrap (deallocator)) (define-archive archive_read_free (_fun _archive-pointer -> _int) #:wrap (deallocator)) (define-archive archive_read_new (_fun -> _archive-pointer) #:wrap (allocator archive_read_free)) (define-archive archive_read_support_compression_all (_fun _archive-pointer -> _int)) (define-archive archive_read_support_format_all (_fun _archive-pointer -> _int)) (define-archive archive_read_open_file (_fun _archive-pointer _string _int -> _int) #:wrap (allocator archive_read_close)) (define-archive archive_read_next_header (_fun _archive-pointer (o : (_ptr o _archive_entry-pointer)) -> (r : _int) -> (values o r))) (define-archive archive_entry_pathname (_fun _archive_entry-pointer -> _string)) (define-archive archive_entry_size (_fun _archive_entry-pointer -> _int)) (define-archive archive_entry_filetype (_fun _archive_entry-pointer -> _int)) (define-archive archive_read_data (_fun _archive-pointer _pointer _int -> _int)) (define (archive-read-current-file archive entry-size) (let ([buffer (malloc 'atomic entry-size)]) (memset buffer 0 entry-size) (archive_read_data archive buffer entry-size) (make-sized-byte-string buffer entry-size))) (struct archive-entry (path size filetype) #:transparent) (define (rewrite-file-mode mode) (match mode [#o0170000 'fmt] [#o0100000 'file] [#o0120000 'link] [#o0140000 'sock] [#o0020000 'chr] [#o0060000 'blk] [#o0040000 'dir] [#o0010000 'fifo])) (define (archive-read-next-file-entry archive) (let-values ([(entry ret) (archive_read_next_header archive)]) (if (= ret ARCHIVE_OK) (archive-entry (archive_entry_pathname entry) (archive_entry_size entry) (rewrite-file-mode (archive_entry_filetype entry))) (if (= ret ARCHIVE_EOF) #f (raise (list "error" ret)))))) (define (archive-folder archive proc init-val) (match (archive-read-next-file-entry archive) [(archive-entry path size filetype) (archive-folder archive proc (apply proc (list (archive-entry path size filetype) (lambda () (archive-read-current-file archive size)) init-val)))] [#f init-val])) (define (archive-fold-files archive-path proc init-val) (let ([archive (archive_read_new)]) ; check that archive != NULL (archive_read_support_compression_all archive) ; check return (archive_read_support_format_all archive) ; check return (archive_read_open_file archive archive-path 10240) ; check return (archive-folder archive proc init-val))) (archive-fold-files "/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr" (lambda (entry reader val) (if (equal? (archive-entry-filetype entry) 'file) (cons (cons (archive-entry-path entry) (let ([data (reader)]) (subbytes data 0 (min (bytes-length data) 10)))) val) val)) '()) ;; (archive-fold-files ;; "/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr" ;; (lambda (entry reader val) ;; (cons (cons (archive-entry-path entry) (archive-entry-size entry)) val)) ;; '()) (provide archive-fold-files archive-entry)
As you can see from this file, there are a bunch of actual foreign function definitions (like archive_read_close
), and then a number of higher-level routines to make it more convenient to get at an archive's contents.
Before we can begin translating this to Haskell foreign functions, there's a lot to learn about how Haskell's FFI works. There are a bunch of articles on that topic on the Haskell Wiki, as well as a very nice series of articles at Inside 233. But like always, I'm just going to cut to the chase and give you the answer, because we're both in a hurry.
To do this, we're going to use Haskell's lowest-level tool, a preprocessor called "hsc2hs" to convert a foreign function definition into working Haskell code. There are several other ways to generate FFI interfaces for Haskell, but hsc2hs isn't too painful, and it's always good to know the basics. Maybe later we'll revisit this with c2hs or something else.
FFI Boilerplate
We start by creating a .hsc file in the right place for Haskell's module hierarchy. If we look on Hackage, we see that many archive handling packages are in the Codec namespace. So we'll do the same and create a file called "Codec/Archive/Libarchive.hsc" and give it some initial boilerplate:
{-# LANGUAGE ForeignFunctionInterface, EmptyDataDecls, DeriveDataTypeable, ScopedTypeVariables #-} #include <archive.h> #include <archive_entry.h> module Codec.Archive.Libarchive where import Data.ByteString (ByteString, replicate, take) import qualified Data.ByteString.Lazy as L import Data.ByteString.Unsafe (unsafePackCString, unsafeUseAsCStringLen, unsafePackCStringFinalizer) import Foreign.C (CInt, CSize, CString, withCAString) import Foreign.Marshal.Alloc (alloca, free) import Foreign.Marshal.Array (mallocArray) import Foreign.Storable (peek) import Foreign.Ptr (Ptr, FunPtr, castFunPtr, castPtr) import Foreign.ForeignPtr (ForeignPtr, newForeignPtr, newForeignPtr_, withForeignPtr) import Prelude hiding (replicate, take) import System.IO.Unsafe (unsafeInterleaveIO)
I've made this easy for you to see which functions come from which modules, in case you want to go look them up. You really don't need to declare the individual functions you use in this way, provided that there aren't any conflicts between modules. (You'll notice that there is a conflict in the above between ByteString's replicate and take which I'm hiding from the Prelude package. Also the Data.ByteString.Lazy module defines a whole parallel set of lazy ByteString functions that we'll get at by using the "L" prefix.)
Now one cool thing about hsc2hs is that you can just include C headers directly, like I've done with the 2 libarchive headers. We'll see how these get used in just a few minutes. But first a bit of bad news. Unlike normal Haskell .hs files, you can't sit in haskell-mode in Emacs and evaluate .hsc files in the same way. I know that's really disappointing, but maybe some Emacs Lisp hacker will come along and fix that sad situation eventually. In the mean time, we'll resort to Haskell's big fancy make system, Cabal, to build this file. Here's the cabal definition, a file we'll name "libarchive.cabal" that lives in our top-level directory:
name: libarchive version: 0.1 license: BSD3 cabal-version: >= 1.8 copyright: (c) 2012 Warren Harris author: Warren Harris <warrensomebody@gmail.com> maintainer: Warren Harris <warrensomebody@gmail.com> stability: alpha synopsis: libarchive bindings description: libarchive bindings category: Codec build-type: Simple Library exposed-modules: Codec.Archive.Libarchive build-depends: base >= 3 && < 5, bytestring >= 0.9 extra-libraries: archive ghc-options: -Wall
First a little aside: If you're on a Mac and you don't yet have libarchive installed, you can get it via macports. I tried this the other day, and decided to port upgrade outdated
at the same time. Eleven (yes, 11) hours later, I had all sorts of shiny new toys like gcc45 and Erlang (Erlang?). This was the last straw, so I decided to switch to Homebrew which provides a much more sane package manager for the Mac. Unfortunately, you need a bit of magic to install the development headers for libarchive with Homebrew:
$ brew tap homebrew/dupes $ brew install `brew --prefix`/Library/Formula/libarchive.rb
Now we're ready to configure and build our (empty) libarchive module::
$ cabal configure $ cabal install
We'll need to re-run the cabal install
command whenever we make a change to our Libarchive.hsc file. At least until we get that Emacs integration.
Our first foreign functions
Now, let's create our first foreign function definitions, starting with the interface to archive_entry.
We'll put these at the end of the Libarchive.hsc file:
data ARCHIVE_ENTRY -- native type newtype ArchiveEntry = ArchiveEntry { unArchiveEntry :: ForeignPtr ARCHIVE_ENTRY } deriving (Show, Eq) foreign import ccall unsafe "archive.h archive_entry_pathname" archive_entry_pathname :: Ptr ARCHIVE_ENTRY -> IO CString archiveEntryPathname :: ArchiveEntry -> IO ByteString archiveEntryPathname ae = withForeignPtr (unArchiveEntry ae) $ \a -> do cstr <- archive_entry_pathname a unsafePackCString cstr foreign import ccall unsafe "archive.h archive_entry_size" archive_entry_size :: Ptr ARCHIVE_ENTRY -> IO CInt archiveEntrySize :: ArchiveEntry -> IO Int archiveEntrySize ae = withForeignPtr (unArchiveEntry ae) $ \a -> do sz <- archive_entry_size a return $ fromIntegral sz foreign import ccall unsafe "archive.h archive_entry_filetype" archive_entry_filetype :: Ptr ARCHIVE_ENTRY -> IO CInt archiveEntryFiletype :: ArchiveEntry -> IO Int archiveEntryFiletype ae = withForeignPtr (unArchiveEntry ae) $ \a -> do mode <- archive_entry_filetype a return $ fromIntegral mode
The first line defines a new data type, ARCHIVE_ENTRY
, with no constructors. The only reason for this is so that we can distinguish native pointers to C archive_entry
structures from other kinds of pointers. In Haskell, you'll see Ptr ARCHIVE_ENTRY
as the type of the first argument of the archive_entry
accessors.
The second line defines a higher-level wrapper type for archive_entry
structures. Here we point to our archive_entry
with a Haskell ForeignPtr
instead of a regular Ptr
. ForeignPtr
s allow us to associate a finalizer callback with a C pointer, so that when the garbage collector decides the object is no longer needed, the C structure can be automatically freed. You aren't required to design foreign function interfaces with ForeignPtr
s. One can always provide APIs that require the user to explicitly free structures, or manage them in a block scope -- but here using finalizers seems like a good way to go. Now, unfortunately in the above routines we don't actually see the finalizer get associated with the instances. That's because all these structs are managed by the archive
structure, and its finalizer will automatically destroy the archive_entry
s. So we'll come back to this finalizer thing later.
Another thing: newtype
is a cool way to define a type that incurs no runtime overhead. Unlike typedef
in C that defines type synonyms, newtype
defines a type that's logically distinct from the thing it wraps, so you can't accidentally use a ForeignPtr ARCHIVE_ENTRY
when an ArchiveEntry
is needed. The ArchiveEntry
type constructor that wraps these values is only there at compile time, so there's no runtime overhead for using a newtype
.
Finally, you'll see that each of these foreign functions definitions has a low-level C function declaration -- the foreign import
statements with associated type information, and the higher-level user entry points that deal with unwrapping the ForeignPtr
s and converting from C types like CInt
to more convenient Haskell types like Int
. Let's look at the first foreign import
statement, archive_entry_pathname
. It uses the unsafe
declaration to say that this function is "unsafe" in the sense that it can't call back into the Haskell runtime. If it did, the Haskell runtime would become seriously confused. However, be aware that even if your native function doesn't call back into Haskell, you might want to declare a function safe if you want to allow Haskell's lightweight threads to preempt it. All unsafe functions run to completion, and hog the entire process while they're running.
All of these foreign import
declarations also state that the native functions are in the IO monad (the final result such as CString
is wrapped with IO
). This might seem odd for a function like archive_entry_size
which is purely functional in nature. The reason I'm doing this is due to the way libarchive works. Since it streams through the archive file's contents, if you call archive_entry_size
after you move on to the next archive header, it returns 0. So it's really critical that when you use it in the IO monad it really runs right away, and isn't run lazily when the value is actually needed. If you allow it to run lazily, it runs too late and comes up empty handed. I found this out the hard way.
This brings us to an interesting point about the IO monad. Like all monads, the IO monad is about sequencing operations together, so that they happen in the right order. However, unlike most monads, the IO monad is strict (i.e. not lazy), which means that the operations happen as soon as you call the functions, rather than when the monad's result value is really needed. We'll come back to this when we talk about lazy IO later.
Moving on to the higher-level functions. First, I've named this by the standard Haskell camel-case naming conventions to distinguish them from the low-level routines. You'll notice that my module declaration doesn't declare which functions are exported (so everything will be exported by default), but the intention is to simply export the high-level functions and types, and hide the lower-level (and possibly dangerous) native methods. Each high-level function like archiveEntryPathname
first unwraps the ArchiveEntry
type by calling the unArchiveEntry
record accessor to get at the ForeignPtr
, and then uses withForeignPtr
to get at the low-level Ptr ARCHIVE_ENTRY
. This is then passed to the native method along with other arguments (translated into their C types), and the results are similarly translated back into Haskell types. For instance, unsafePackCString
creates a ByteString
from a null-terminated CString
, and fromIntegral
is the numeric workhorse that converts all sorts of numbers to whatever's needed.
Now one more thing we'll need when we call archiveEntryFiletype
is a set of constants for the various modes it returns (things like AE_IFREG
in the "archive.h" header). We could have defined these with data constructors, but since I'm lazy I'll just define a set of octal constants here. Note that they have to begin with a lower-case letter since they're values not types:
ae_ifmt :: Int ae_ifmt = 0o170000 ae_ifreg :: Int ae_ifreg = 0o100000 ae_iflnk :: Int ae_iflnk = 0o120000 ae_ifsock :: Int ae_ifsock = 0o140000 ae_ifchr :: Int ae_ifchr = 0o020000 ae_ifblk :: Int ae_ifblk = 0o060000 ae_ifdir :: Int ae_ifdir = 0o040000 ae_ififo :: Int ae_ififo = 0o010000
Skipping right along...
Before we define foreign functions for archive
operations, let's first define some routines for dealing with these function's return types. Instead of testing an integer result, let's use a data type to define the possible result codes, and some helper functions for calling these functions and converting their int
results to this type:
data Result = OK | EOF | RETRY | WARN | FAILED | FATAL deriving (Eq, Show) result :: CInt -> Result result 0 = OK result 1 = EOF result (-10) = RETRY result (-20) = WARN result (-25) = FAILED result (-30) = FATAL result _ = FATAL -- unexpected result withResult :: (Ptr ARCHIVE -> IO CInt) -> Archive -> IO Result withResult f ar = withForeignPtr (unArchive ar) $ \a -> do rv <- f a return $ result rv expectOK :: IO Result -> IO () expectOK m = m >>= go where go OK = return () go err = ioError $ userError (show err)
The function expectOK
is something that wraps an IO operation returning a Result into one that throws an exception instead. This is just for the cases where we'd like to streamline the common case, and not deal with error handling every step of the way. I'm not sure that this function makes any sense in a production-quality version of this library, but we'll use it here when it's convenient. Also note that I originally wrote the last line as go err = error (show err)
(err
is the Result
value, and error
is a function that aborts the computation by raising an error). However, the exceptions raise by error
can't be caught like those by ioError
, so I switched.
Finally, here are the rest of the FFI definitions we'll need. They aren't all that interesting after you've groked the above, so I'm not going to say any more about them:
data ARCHIVE -- native type newtype Archive = Archive { unArchive :: ForeignPtr ARCHIVE } deriving (Show, Eq) foreign import ccall unsafe "archive.h archive_read_new" archive_read_new :: IO (Ptr ARCHIVE) foreign import ccall unsafe "archive.h &archive_read_finish" p_archive_read_finish :: FunPtr (Ptr ARCHIVE -> IO CInt) archiveReadNew :: IO Archive archiveReadNew = do ar <- archive_read_new fp <- newForeignPtr (castFunPtr p_archive_read_finish) ar return $ Archive fp foreign import ccall unsafe "archive.h archive_read_support_compression_all" archive_read_support_compression_all :: Ptr ARCHIVE -> IO CInt archiveReadSupportCompressionAll :: Archive -> IO Result archiveReadSupportCompressionAll ar = withResult archive_read_support_compression_all ar foreign import ccall unsafe "archive.h archive_read_support_format_all" archive_read_support_format_all :: Ptr ARCHIVE -> IO CInt archiveReadSupportFormatAll :: Archive -> IO Result archiveReadSupportFormatAll ar = withResult archive_read_support_format_all ar foreign import ccall unsafe "archive.h archive_read_open_file" archive_read_open_file :: Ptr ARCHIVE -> CString -> CSize -> IO CInt archiveReadOpenFile :: Archive -> String -> Int -> IO Result archiveReadOpenFile ar filename blocksize = withCAString filename $ \c_filename -> do let sz = fromIntegral blocksize withResult (\a -> archive_read_open_file a c_filename sz) ar foreign import ccall unsafe "archive.h archive_read_next_header" archive_read_next_header :: Ptr ARCHIVE -> Ptr (Ptr ARCHIVE_ENTRY) -> IO CInt archiveReadNextHeader :: Archive -> IO (Either Result ArchiveEntry) archiveReadNextHeader ar = alloca $ \ptr -> do rv <- withResult (\a -> archive_read_next_header a ptr) ar case rv of OK -> do entry <- peek ptr fp <- newForeignPtr_ entry return $ Right $ ArchiveEntry fp err -> return $ Left err foreign import ccall unsafe "archive.h archive_read_close" archive_read_close :: Ptr ARCHIVE -> IO CInt archiveReadClose :: Archive -> IO Result archiveReadClose ar = withResult archive_read_close ar foreign import ccall unsafe "archive.h archive_read_finish" archive_read_finish :: Ptr ARCHIVE -> IO CInt archiveReadFinish :: Archive -> IO Result archiveReadFinish ar = withResult archive_read_finish ar foreign import ccall unsafe "archive.h archive_read_data" archive_read_data :: Ptr ARCHIVE -> Ptr CString -> CSize -> IO CSize archiveReadData :: Archive -> ByteString -> IO Int archiveReadData ar bs = withForeignPtr (unArchive ar) $ \a -> unsafeUseAsCStringLen bs $ \(buf, len) -> do rv <- archive_read_data a (castPtr buf) (fromIntegral len) return $ fromIntegral rv
Ok, I will say one thing. In archiveReadNextHeader
, you can see that the native archive_entry
is wrapped as a ForeignPtr
using newForeignPtr_
. Unlike newForeignPtr
, newForeignPtr_
doesn't take a finalizer. This is because the archive_entry
struct will be destroyed when we destroy the archive
structure, not when the Haskell ArchiveEntry
object goes out of scope. However, in archiveReadNew
you can see where we do associate the archive_read_finish
native function as a destructor for the Haskell Archive
object. (Note how we have to use FunPtr
when we want a native pointer to a function. Also, the castPtr
is used because we're ignoring the result of archive_read_finish
. Haskell expects finalizers to return IO ()
.)
Next, in part 2 we'll get down to translating higher-level Racket routines like archive-fold-files
. Don't doze off yet...
No comments:
Post a Comment