Stream Fusion: How Haskell Streamly Achieves C-like Performance

COMPOSEWELL TECHNOLOGIES

BLOG

Posts and announcements for all things Composewell, including new streamly releases, performance notes, and notes on general haskell topics.

Composewell builds Streamly and other open-source Haskell libraries. We also offer consulting and training in Haskell and high-performance functional programming. Learn more at composewell.com.

13 Sep 2025

Harendra Kumar

Twitter

Stream Fusion: How Haskell Streamly Achieves C-like Performance

About This Series

This is the first in a three-part series on how Streamly achieves high performance in Haskell:

Stream Fusion & GHC – What stream fusion is and why it matters.
Fusion in Streamly – How Streamly leverages fusion in practice.
Fusion Plugin – Ensuring reliable stream fusion with a GHC plugin.

Why Stream Fusion Matters

Haskell encourages writing modular, high-level programs using composable building blocks. But modularity often comes at a cost: performance. In imperative languages like C, developers write tight loops that closely follow the CPU execution model. In Haskell, we want the same performance without giving up high-level abstraction and modularity.

Streamly exists to bridge this gap. Its goal is bold but simple: C-like performance in Haskell, achieved by exploiting compiler optimizations — most importantly, stream fusion.

Streamly at a Glance

Streamly is built on two simple but powerful abstractions:

Streams – the functional counterpart of loops. In imperative code you’d reach for for or while; in functional code you use streams. Streams in Streamly extend the ubiquitous Haskell lists with effects and declarative concurrency, making streams suitable for both sequential and parallel processing. Think of streams as concurrent, composable loops.
Arrays – efficient storage with mutable and immutable variants, tightly integrated with streams.

Together, these two abstractions elegantly unify several disparate abstractions across the Haskell ecosystem: streams, strict and lazy ByteString, strict and lazy Text, Vector, binary serialization libraries, specialized builders, and more. The goal of streamly is to build everything using the same well-designed, well-tested, high-performance fundamental abstractions, do not repeat yourself, and design for expressiveness, composability, and first-class concurrency.

The Performance Challenge in Haskell

High-Level Modularity

Stream pipelines in Haskell are composed with combinators like map, filter, and fold. Each combinator represents a piece of a loop, and when combined they form a complete data-processing pipeline.

Programmers build these pipelines declaratively from small, modular building blocks. The compiler then takes care of turning the declarative description into efficient low-level loops.

This separation of what you describe (the modular pipeline) from how it runs (tight compiled loops) is what enables a truly functional approach to programming in Haskell and specifically in Streamly: constructing larger systems from smaller, composable parts while staying at a higher level of abstraction.

Matching Low-Level Performance

The challenge: can such modular code ever run as fast as hand-written C?

In practice, GHC already delivers this in many cases. Libraries like vector and foldl have shown that careful design, combined with compiler optimizations, can produce performance rivaling C. Streamly builds on the earlier work, extending fusion to richer, more complex, and more practical use cases.

Heap Allocation Costs

Modularity has a hidden cost: boxing i.e. indirection. In Haskell, most values are boxed — represented as pointers to heap-allocated data. Passing values between different stages of modular pipelines often means creating new heap objects, which adds indirection, allocation overhead, and eventual garbage-collection pressure.

Boxed and unboxed values in Haskell:

-- A boxed integer (heap allocated value)
x :: Int
x = I# 42#

-- An unboxed integer (raw machine value)
y :: Int#
y = 42#

Int is a boxed type: every Int value is wrapped in the I# constructor, which must be allocated on the heap. In contrast, Int# is an unboxed primitive value that lives only in low-level machine code and does not involve heap allocation. Because the garbage collector requires persistent data to be boxed, boxing is unavoidable in many cases — but in tight loops the distinction is critical. More allocations almost always mean slower programs, and in fact, boxing is the single largest source of overhead in otherwise efficient compiler-generated code.

Stream Fusion in GHC

What Fusion Does

Stream fusion eliminates intermediate allocations by fusing multiple stages of pipelines into a single loop. Instead of passing boxed values between stages, results flow directly from one stage to another in unboxed form.

The outcome: high-level pipelines collapse into monolithic low-level loops, avoiding boxing overhead. These optimized loops are as good as hand-written C code.

Streamly’s Design for Fusion

Not all code is equally amenable to fusion. For the compiler to optimize effectively, data types and combinators must be designed with fusion in mind.

Streamly’s core types are designed specifically for this purpose. The result is that programs written with modular combinators can compile into loops that are not just comparable to hand-written C, but sometimes even better — since the compiler has no concern for readability or maintainability, it can produce loops far too complex for humans to write manually.

Inside GHC Optimizations

From Source to Core

When writing Haskell, it’s tempting to think like a C programmer — tweaking code to “optimize” it manually. But in Haskell, this often doesn’t matter.

Looking at GHC’s intermediate language, Core, you’ll often find that different source-level implementations compile to exactly the same Core. That’s because GHC aggressively simplifies programs.

Purity Empowers Performance

Haskell has purity — something most mainstream languages do not have. Referential transparency gives GHC extraordinary freedom to transform programs: it can aggressively rewrite and simplify code in ways that would be unsafe or even impossible in impure languages.

As a result, high-level functional programs can be collapsed into low-level imperative loops with no performance loss.

This advantage, however, is not always fully exploited. Streamly’s mission is to make it practical and reliable for general-purpose programming.

Case Studies: Haskell Matching C & Rust

Streamly and other Haskell libraries already show that functional programming and high performance can coexist. A few examples:

Unicode normalization – The unicode-data and unicode-transforms libraries achieve performance that matches — and in some cases exceeds — the ICU C++ library.
- Haskell vs C++ Benchmarks
- Talk: High Performance Haskell
Directory traversal – A Streamly implementation rivals, and sometimes surpasses, the fastest Rust implementation, while being shorter and highly modular.
- Talk: Blazing-Fast Directory Tree Traversal
Line and word count – A modular word-count program in Streamly matches C performance, supports UTF-8 decoding, and can be parallelized with ease, even UTF-8 decoding can be parallelized.
- Code examples available here

These examples are proof points: Haskell’s high-level abstractions, when paired with fusion and careful design, can go head-to-head with low-level systems languages like C and Rust.

Key Takeaways

Haskell can match C-like performance through compiler optimizations.
Stream fusion is the central technique for eliminating overhead in modular code.
Streamly was designed from the ground up to make fusion reliable, enabling both modularity and performance.

Looking Ahead: Reliable Stream Fusion

Stream fusion in GHC is powerful, but not always predictable. Sometimes optimizations fail to fire, leaving allocations in the final code.

In the next posts, we’ll look at how Streamly addresses this: first by exploring how fusion works in detail, and then by introducing a GHC fusion-plugin that ensures fusion works reliably. Up next:

About This Series
Why Stream Fusion Matters
Streamly at a Glance
The Performance Challenge in Haskell
- High-Level Modularity
- Matching Low-Level Performance
Heap Allocation Costs
Stream Fusion in GHC
- What Fusion Does
- Streamly’s Design for Fusion
Inside GHC Optimizations
- From Source to Core
- Purity Empowers Performance
Case Studies: Haskell Matching C & Rust
Key Takeaways
Looking Ahead: Reliable Stream Fusion