Aqueduct is a framework for analyzing large data sets by composing small functional building blocks into complex pipeline graphs that are processed as streams.

Aqueduct provides a simple way to think about your data processing task. You break your task into step-by-step tasks, or pipe stages, that each operate on a stream of data. Each task can do as much or as little as desired, and can add to the stream of data, modify the stream of data, or combine the stream with other data sources. Aqueduct is written entirely in F#, and while it can be used from any .NET language, the functional nature of F# matches well with the functional primitives exposed by Aqueduct (Maps, Filters, Joins, etc)

A pipe stage is simply a step in a pipeline. Once you can describe your task as a system of pipeline stages, you get a number of benefits:
• Built-in data parallelism. Each stage can operate on a different data point, as in a conveyor belt-driven assembly line. That is, as a data point progresses through the pipeline, the stage it just finished can operate on the next data point. Also, if the stages are written carefully, computation intensive operations can be done in parallel and then the results can be merged back together.
• Composability. You can build your computation from small, independently testable pipe segments, and collect them into one larger pipe system. You can take the output of one computation and feed it into another to perform a new computation or to add additional data.
• Scalability. Since Aqueduct operates on individual data points sequentially, the entire data set never needs to be read into memory at once. This means that you aren't limited by the amount of memory on your machine.
• Declarative programming. Aqueduct allows you to specify what computation you want done declaratively, and makes the data flow graph explicit. This can help understand exactly what is getting done, where to add extension points, and help prevent coding errors (in a similar manner to functional programming which can prevent some coding errors with immutable state, general avoidance of null values). Similarly, you can separate what you're calculating from where you're getting the data from.
In short, once you have described your problem as a graph of pipeline stages, you can separate the question of what you want to compute from the nuts and bolts of how it gets computed - things like threading, distributing, exploiting parallelism, etc.

Aqueduct was developed at Mindset Media, Llc

(Questions / Comments ? Please send e-mail to jtigani at gmail )

Last edited Oct 1, 2009 at 12:05 AM by jtigani, version 6