Composewell builds Streamly and other open-source Haskell libraries. We also offer consulting and training in Haskell and high-performance functional programming. Learn more at composewell.com.
This is the first in a three-part series on how Streamly achieves high performance in Haskell:
Haskell encourages writing modular, high-level programs using composable building blocks. But modularity often comes at a cost: performance. In imperative languages like C, developers write tight loops that closely follow the CPU execution model. In Haskell, we want the same performance without giving up high-level abstraction and modularity.
Streamly exists to bridge this gap. Its goal is bold but simple: C-like performance in Haskell, achieved by exploiting compiler optimizations — most importantly, stream fusion.
Streamly is built on two simple but powerful abstractions:
Streams – the functional counterpart of loops. In imperative
code you’d reach for for or while; in functional code you use
streams. Streams in Streamly extend the ubiquitous Haskell lists with
effects and declarative concurrency, making streams suitable for both
sequential and parallel processing. Think of streams as concurrent,
composable loops.
Arrays – efficient storage with mutable and immutable variants, tightly integrated with streams.
Together, these two abstractions elegantly unify several disparate
abstractions across the Haskell ecosystem: streams, strict and lazy
ByteString, strict and lazy Text, Vector, binary serialization
libraries, specialized builders, and more. The goal of streamly
is to build everything using the same well-designed, well-tested,
high-performance fundamental abstractions, do not repeat yourself, and
design for expressiveness, composability, and first-class concurrency.
Stream pipelines in Haskell are composed with combinators like map,
filter, and fold. Each combinator represents a piece of a loop, and
when combined they form a complete data-processing pipeline.
Programmers build these pipelines declaratively from small, modular building blocks. The compiler then takes care of turning the declarative description into efficient low-level loops.
This separation of what you describe (the modular pipeline) from how it runs (tight compiled loops) is what enables a truly functional approach to programming in Haskell and specifically in Streamly: constructing larger systems from smaller, composable parts while staying at a higher level of abstraction.
The challenge: can such modular code ever run as fast as hand-written C?
In practice, GHC already delivers this in many cases. Libraries like
vector and foldl have shown that careful design, combined with
compiler optimizations, can produce performance rivaling C. Streamly
builds on the earlier work, extending fusion to richer, more complex, and
more practical use cases.
Modularity has a hidden cost: boxing i.e. indirection. In Haskell, most values are boxed — represented as pointers to heap-allocated data. Passing values between different stages of modular pipelines often means creating new heap objects, which adds indirection, allocation overhead, and eventual garbage-collection pressure.
Boxed and unboxed values in Haskell:
-- A boxed integer (heap allocated value)
x :: Int
x = I# 42#
-- An unboxed integer (raw machine value)
y :: Int#
y = 42#Int is a boxed type: every Int value is wrapped in the I#
constructor, which must be allocated on the heap. In contrast, Int#
is an unboxed primitive value that lives only in low-level machine
code and does not involve heap allocation. Because the garbage
collector requires persistent data to be boxed, boxing is unavoidable
in many cases — but in tight loops the distinction is critical. More
allocations almost always mean slower programs, and in fact, boxing
is the single largest source of overhead in otherwise efficient
compiler-generated code.
Stream fusion eliminates intermediate allocations by fusing multiple stages of pipelines into a single loop. Instead of passing boxed values between stages, results flow directly from one stage to another in unboxed form.
The outcome: high-level pipelines collapse into monolithic low-level loops, avoiding boxing overhead. These optimized loops are as good as hand-written C code.
Not all code is equally amenable to fusion. For the compiler to optimize effectively, data types and combinators must be designed with fusion in mind.
Streamly’s core types are designed specifically for this purpose. The result is that programs written with modular combinators can compile into loops that are not just comparable to hand-written C, but sometimes even better — since the compiler has no concern for readability or maintainability, it can produce loops far too complex for humans to write manually.
When writing Haskell, it’s tempting to think like a C programmer — tweaking code to “optimize” it manually. But in Haskell, this often doesn’t matter.
Looking at GHC’s intermediate language, Core, you’ll often find that different source-level implementations compile to exactly the same Core. That’s because GHC aggressively simplifies programs.
Haskell has purity — something most mainstream languages do not have. Referential transparency gives GHC extraordinary freedom to transform programs: it can aggressively rewrite and simplify code in ways that would be unsafe or even impossible in impure languages.
As a result, high-level functional programs can be collapsed into low-level imperative loops with no performance loss.
This advantage, however, is not always fully exploited. Streamly’s mission is to make it practical and reliable for general-purpose programming.
Streamly and other Haskell libraries already show that functional programming and high performance can coexist. A few examples:
Unicode normalization – The unicode-data and
unicode-transforms libraries achieve performance that matches — and
in some cases exceeds — the ICU C++ library.
Directory traversal – A Streamly implementation rivals, and sometimes surpasses, the fastest Rust implementation, while being shorter and highly modular.
Line and word count – A modular word-count program in Streamly matches C performance, supports UTF-8 decoding, and can be parallelized with ease, even UTF-8 decoding can be parallelized.
These examples are proof points: Haskell’s high-level abstractions, when paired with fusion and careful design, can go head-to-head with low-level systems languages like C and Rust.
Stream fusion in GHC is powerful, but not always predictable. Sometimes optimizations fail to fire, leaving allocations in the final code.
In the next posts, we’ll look at how Streamly addresses this: first by exploring how fusion works in detail, and then by introducing a GHC fusion-plugin that ensures fusion works reliably. Up next: