Skip to content

Benchmarks

Flow-Like’s runtime is built in Rust for predictable, high-throughput workflow execution. Here are the results from our benchmark suite using mimalloc allocator.

To provide a fair comparison with n8n (which uses 4 vCPUs), we benchmark with 4 worker threads:

MetricValueDescription
Single Execution~1.2msTime to execute a simple 2-node workflow
Peak Throughput (4 threads)~124,000 workflows/secAt 8K concurrent workflows
Peak Throughput (16 threads)~244,000 workflows/secAt 65K concurrent workflows
Step Latency~20-40µsPer-node execution overhead

Throughput by Concurrency Level (4 Threads)

Section titled “Throughput by Concurrency Level (4 Threads)”

These results use 4 worker threads to match typical cloud VM configurations (e.g., n8n’s c5a.large):

ConcurrencyThroughputLatency
128~65,000 exec/s2.0ms
512~100,000 exec/s5.1ms
1,024~112,000 exec/s9.1ms
2,048~121,000 exec/s17ms
4,096~123,000 exec/s33ms
8,192~124,000 exec/s66ms

Throughput by Concurrency Level (16 Threads)

Section titled “Throughput by Concurrency Level (16 Threads)”

With full 16-core utilization:

ConcurrencyThroughputLatency
128~60,000 exec/s2.1ms
512~140,000 exec/s3.6ms
1,024~177,000 exec/s5.7ms
4,096~228,000 exec/s18ms
8,192~238,000 exec/s35ms
32,768~241,000 exec/s135ms
65,536~244,000 exec/s269ms

Using mimalloc provides significant performance improvements over the system allocator:

AllocatorThroughput (1K conc.)Improvement
mimalloc~222,000 exec/s+24%
system~179,000 exec/sbaseline

Run these benchmarks on your own hardware:

Terminal window
# Test peak throughput with various concurrency levels
FL_CONCURRENCY_LIST="128,512,1024,4096,8192" \
RUST_LOG=off cargo bench --bench throughput_bench --features mimalloc -- peak

Customize benchmark behavior:

VariableDefaultDescription
FL_WORKER_THREADSCPU countTokio worker threads
FL_CONCURRENCY_LISTAutoComma-separated concurrency levels to test
FL_MAX_CONCURRENCYCPU × 8Max concurrency for auto-sweep
FL_MEASURE_SECS10Measurement duration per level
RUST_LOG-Set to off for accurate benchmarks
  • Rust toolchain (stable)
  • Test data in tests/ directory
  • Recommended: 8+ cores for meaningful throughput tests

Results shown were measured on:

  • CPU: 16 cores (Apple M-series)
  • Memory: 32GB
  • OS: macOS
  • Rust: Stable toolchain
  • Build: Release mode with LTO
  • Allocator: mimalloc
  1. Concurrency Level — Higher concurrency enables better CPU utilization up to ~65K concurrent
  2. Allocator Choice — mimalloc provides ~24% improvement over system allocator
  3. Node Complexity — Simple data routing is fast; heavy compute nodes dominate execution time
  4. Graph Depth — More sequential nodes = more steps = longer execution
  5. Data Size — Large payloads increase serialization/deserialization overhead
  6. Tracing Level — Use LogLevel::Fatal for benchmarks; full tracing adds overhead

Both benchmarks execute a comparable task: a simple 2-node workflow. For a fair comparison, we use 4 threads to match n8n’s c5a.large (4 vCPU) setup (n8n benchmarks):

PlatformSetupThroughputvs n8n
Flow-Like4 threads, mimalloc~124,000 exec/sec564× faster
Flow-Like16 threads, mimalloc~244,000 exec/sec1,109× faster
n8n (single)c5a.large (4 vCPU)~220 exec/secbaseline
n8n (scaled)7× c5a.4xlarge~2,000 exec/sec9× baseline
PlatformExecution ModelTypical Latency
Flow-LikeNative Rust, typed~1-2ms per workflow
Node-based toolsJavaScript/Python~10-50ms per workflow
Cloud workflowsHTTP-based~100-500ms per workflow

Found a performance issue or want to add a benchmark?

  1. Check existing benchmarks in packages/catalog/benches/
  2. Use Criterion for consistent measurement
  3. Document what you’re measuring and why
  4. Submit a PR with before/after results