Benchmarks
Flow-Like’s runtime is built in Rust for predictable, high-throughput workflow execution. Here are the results from our benchmark suite using mimalloc allocator.
Summary (4 Threads - Fair Comparison)
Section titled “Summary (4 Threads - Fair Comparison)”To provide a fair comparison with n8n (which uses 4 vCPUs), we benchmark with 4 worker threads:
| Metric | Value | Description |
|---|---|---|
| Single Execution | ~1.2ms | Time to execute a simple 2-node workflow |
| Peak Throughput (4 threads) | ~124,000 workflows/sec | At 8K concurrent workflows |
| Peak Throughput (16 threads) | ~244,000 workflows/sec | At 65K concurrent workflows |
| Step Latency | ~20-40µs | Per-node execution overhead |
Throughput by Concurrency Level (4 Threads)
Section titled “Throughput by Concurrency Level (4 Threads)”These results use 4 worker threads to match typical cloud VM configurations (e.g., n8n’s c5a.large):
| Concurrency | Throughput | Latency |
|---|---|---|
| 128 | ~65,000 exec/s | 2.0ms |
| 512 | ~100,000 exec/s | 5.1ms |
| 1,024 | ~112,000 exec/s | 9.1ms |
| 2,048 | ~121,000 exec/s | 17ms |
| 4,096 | ~123,000 exec/s | 33ms |
| 8,192 | ~124,000 exec/s | 66ms |
Throughput by Concurrency Level (16 Threads)
Section titled “Throughput by Concurrency Level (16 Threads)”With full 16-core utilization:
| Concurrency | Throughput | Latency |
|---|---|---|
| 128 | ~60,000 exec/s | 2.1ms |
| 512 | ~140,000 exec/s | 3.6ms |
| 1,024 | ~177,000 exec/s | 5.7ms |
| 4,096 | ~228,000 exec/s | 18ms |
| 8,192 | ~238,000 exec/s | 35ms |
| 32,768 | ~241,000 exec/s | 135ms |
| 65,536 | ~244,000 exec/s | 269ms |
Allocator Comparison
Section titled “Allocator Comparison”Using mimalloc provides significant performance improvements over the system allocator:
| Allocator | Throughput (1K conc.) | Improvement |
|---|---|---|
| mimalloc | ~222,000 exec/s | +24% |
| system | ~179,000 exec/s | baseline |
Running Benchmarks
Section titled “Running Benchmarks”Run these benchmarks on your own hardware:
# Test peak throughput with various concurrency levelsFL_CONCURRENCY_LIST="128,512,1024,4096,8192" \RUST_LOG=off cargo bench --bench throughput_bench --features mimalloc -- peak# Compare system allocator vs mimallocRUST_LOG=off bash packages/catalog/benches/compare_allocators.sh# Single workflow execution latencyRUST_LOG=off cargo bench --bench allocator_bench --features mimalloc -- single_exec# Run all benchmarksRUST_LOG=off cargo bench --features mimallocEnvironment Variables
Section titled “Environment Variables”Customize benchmark behavior:
| Variable | Default | Description |
|---|---|---|
FL_WORKER_THREADS | CPU count | Tokio worker threads |
FL_CONCURRENCY_LIST | Auto | Comma-separated concurrency levels to test |
FL_MAX_CONCURRENCY | CPU × 8 | Max concurrency for auto-sweep |
FL_MEASURE_SECS | 10 | Measurement duration per level |
RUST_LOG | - | Set to off for accurate benchmarks |
Requirements
Section titled “Requirements”- Rust toolchain (stable)
- Test data in
tests/directory - Recommended: 8+ cores for meaningful throughput tests
Benchmark Environment
Section titled “Benchmark Environment”Results shown were measured on:
- CPU: 16 cores (Apple M-series)
- Memory: 32GB
- OS: macOS
- Rust: Stable toolchain
- Build: Release mode with LTO
- Allocator: mimalloc
What Affects Performance?
Section titled “What Affects Performance?”- Concurrency Level — Higher concurrency enables better CPU utilization up to ~65K concurrent
- Allocator Choice — mimalloc provides ~24% improvement over system allocator
- Node Complexity — Simple data routing is fast; heavy compute nodes dominate execution time
- Graph Depth — More sequential nodes = more steps = longer execution
- Data Size — Large payloads increase serialization/deserialization overhead
- Tracing Level — Use
LogLevel::Fatalfor benchmarks; full tracing adds overhead
Comparison with n8n
Section titled “Comparison with n8n”Both benchmarks execute a comparable task: a simple 2-node workflow. For a fair comparison, we use 4 threads to match n8n’s c5a.large (4 vCPU) setup (n8n benchmarks):
| Platform | Setup | Throughput | vs n8n |
|---|---|---|---|
| Flow-Like | 4 threads, mimalloc | ~124,000 exec/sec | 564× faster |
| Flow-Like | 16 threads, mimalloc | ~244,000 exec/sec | 1,109× faster |
| n8n (single) | c5a.large (4 vCPU) | ~220 exec/sec | baseline |
| n8n (scaled) | 7× c5a.4xlarge | ~2,000 exec/sec | 9× baseline |
General Comparison
Section titled “General Comparison”| Platform | Execution Model | Typical Latency |
|---|---|---|
| Flow-Like | Native Rust, typed | ~1-2ms per workflow |
| Node-based tools | JavaScript/Python | ~10-50ms per workflow |
| Cloud workflows | HTTP-based | ~100-500ms per workflow |
Contributing Benchmarks
Section titled “Contributing Benchmarks”Found a performance issue or want to add a benchmark?
- Check existing benchmarks in
packages/catalog/benches/ - Use Criterion for consistent measurement
- Document what you’re measuring and why
- Submit a PR with before/after results