Skip to content

Machine Learning

Flow-Like includes a complete machine learning toolkit built on linfa (Rust’s ML library) and ONNX Runtime for neural network inference. Train models visually without writing code.

AlgorithmNodeBest For
Decision TreeFit Decision TreeInterpretable rules, multi-class
Naive BayesFit Naive BayesFast baseline, Gaussian features
SVMFit SVM Multi-ClassHigh accuracy, complex boundaries
AlgorithmNodeBest For
Linear RegressionFit Linear RegressionContinuous predictions, feature importance
AlgorithmNodeBest For
K-MeansFit KMeansKnown cluster count, spherical clusters
DBSCANFit DBSCANUnknown cluster count, outlier detection
AlgorithmNodeBest For
PCAFit PCAFeature reduction, visualization prep
Model TypeNodeBest For
Image ClassificationONNX TIMMClassify images
Object DetectionONNX YOLO/D-FINEDetect objects in images
Teachable MachineTeachable MachineQuick prototyping

ML nodes expect data in a LanceDB database with:

  • A records column: 2D float array (feature matrix)
  • A targets column: labels (classification) or values (regression)
┌────────────────────────────────────────────────────────────┐
│ │
│ 1. Load Data (CSV, SQL, etc.) │
│ │ │
│ ▼ │
│ 2. Insert into Database │
│ │ │
│ ▼ │
│ 3. Format as records/targets │
│ │ │
│ ▼ │
│ 4. Split (train/test) │
│ │ │
│ ▼ │
│ 5. Train Model │
│ │
└────────────────────────────────────────────────────────────┘

Random Split:

Split Dataset
├── Database: (input data)
├── Split Ratio: 0.8 (80% train, 20% test)
├── Train ──▶ (training database)
└── Test ──▶ (test database)

Stratified Split (preserves class distribution):

Stratified Split
├── Database: (input data)
├── Target Column: "label"
├── Split Ratio: 0.8
├── Train ──▶ (balanced training set)
└── Test ──▶ (balanced test set)
NodePurpose
Shuffle DatasetRandomize row order
Sample DatasetTake a random subset

Decision trees create interpretable if-then rules:

Fit Decision Tree
├── Database: (training data)
├── Max Depth: 10 (0 = unlimited)
├── Min Samples Split: 2
└── Model ──▶ (trained decision tree)

When to use:

  • You need to explain predictions
  • Data has clear decision boundaries
  • Multi-class classification

Parameters:

ParameterEffectRecommendation
Max DepthTree complexityStart with 5-10, increase if underfitting
Min Samples SplitMinimum samples to splitHigher values prevent overfitting

Fast Gaussian classifier:

Fit Naive Bayes
├── Database: (training data)
└── Model ──▶ (trained Naive Bayes)

When to use:

  • Quick baseline model
  • Features are roughly Gaussian
  • Fast inference needed

Pros/Cons:

ProsCons
Very fast trainingAssumes feature independence
Works with small datasetsLess accurate than trees/SVM
Handles multi-class naturallySensitive to feature scaling

High-accuracy classifier with RBF kernel:

Fit SVM Multi-Class
├── Database: (training data)
└── Model ──▶ (trained SVM ensemble)

When to use:

  • Maximum accuracy needed
  • Smaller datasets (< 10,000 samples)
  • Complex decision boundaries

Notes:

  • Uses One-vs-All strategy for multi-class
  • Gaussian (RBF) kernel by default
  • Slower training than trees/Naive Bayes

Predict continuous values:

Fit Linear Regression
├── Database: (training data with numeric targets)
└── Model ──▶ (trained linear model)

When to use:

  • Predicting continuous values
  • Understanding feature importance
  • Linear relationship expected

Getting Coefficients:

Get Linear Coefficients
├── Model: (trained linear regression)
└── Info ──▶ {
coefficients: [0.5, -0.3, 0.8],
intercept: 2.1,
n_features: 3
}

Partition data into k clusters:

Fit KMeans
├── Database: (data with records column)
├── Clusters: 5 (number of clusters)
└── Model ──▶ (trained KMeans)

When to use:

  • You know the number of clusters
  • Clusters are roughly spherical
  • Customer segmentation, grouping

Getting Centroids:

Get KMeans Centroids
├── Model: (trained KMeans)
└── Info ──▶ {
k: 5,
dimensions: 3,
centroids: [[...], [...], ...]
}

Density-based clustering:

Fit DBSCAN
├── Database: (data with records column)
├── Epsilon: 0.5 (max distance between points)
├── Min Points: 5 (points to form dense region)
├── End ──▶ (clustering complete)
├── N Clusters ──▶ (number found)
└── N Noise ──▶ (outliers found)

When to use:

  • Unknown number of clusters
  • Need to detect outliers/anomalies
  • Non-spherical cluster shapes

Reduce feature dimensions:

Fit PCA
├── Database: (high-dimensional data)
├── N Components: 2 (target dimensions)
├── Output Column: "reduced"
├── End ──▶ (reduction complete)
└── Vectors ──▶ (reduced vectors)

When to use:

  • Too many features (high-dimensional data)
  • Preparing for visualization (reduce to 2-3D)
  • Removing noise/redundant features

The Predict node works with any trained model:

Predict
├── Model: (any trained ML model)
├── Mode: "Database"
├── Database: (data to predict)
├── Input Column: "records"
├── Output Column: "predictions"
├── Batch Size: 5000
├── End ──▶ (predictions complete)
└── Database ──▶ (with predictions column)

For single predictions:

Predict
├── Model: (trained model)
├── Mode: "Vector"
├── Vector: [1.5, 2.3, 0.8, ...] (features)
└── Prediction ──▶ "class_a" (or numeric value)

Accuracy:

Evaluate Accuracy
├── Database: (with predictions & targets)
├── Prediction Column: "predictions"
├── Target Column: "targets"
└── Result ──▶ {
accuracy: 0.92,
correct: 920,
total: 1000
}

Confusion Matrix:

Evaluate Confusion Matrix
├── Database: (with predictions & targets)
├── Prediction Column: "predictions"
├── Target Column: "targets"
└── Result ──▶ {
matrix: [[45, 5], [3, 47]],
precision: [0.94, 0.90],
recall: [0.90, 0.94],
f1_score: [0.92, 0.92]
}
Evaluate Regression
├── Database: (with predictions & targets)
├── Prediction Column: "predictions"
├── Target Column: "targets"
└── Result ──▶ {
mse: 0.05,
rmse: 0.22,
mae: 0.18,
r_squared: 0.89
}

Metric Guide:

MetricDescriptionGood Value
MSEMean Squared ErrorLower is better
RMSERoot MSE (same units as target)Lower is better
MAEMean Absolute ErrorLower is better
Variance explainedCloser to 1.0
Save ML Model
├── Model: (trained model)
├── Path: (FlowPath for output)
└── End

Formats:

  • JSON – Human-readable, portable
  • Binary – Faster, smaller (Fory format)
Load ML Model
├── Path: (FlowPath to saved model)
└── Model ──▶ (loaded model ready for predictions)

For pre-trained neural networks:

Load ONNX
├── Path: (FlowPath to .onnx file)
└── Session ──▶ (ONNX inference session)

Use models exported from PyTorch Image Models:

ONNX Classification
├── Session: (ONNX session)
├── Image: (image data)
├── Top K: 5
└── Results ──▶ [
{class_idx: 281, score: 0.92},
{class_idx: 282, score: 0.05},
...
]

Detect objects in images:

ONNX Detection
├── Session: (ONNX session)
├── Image: (image data)
├── Confidence: 0.5
├── NMS Threshold: 0.4
└── Detections ──▶ [
{class_idx: 0, score: 0.95, x1: 10, y1: 20, x2: 100, y2: 150},
...
]

For Google Teachable Machine models:

Teachable Machine
├── Path: (FlowPath to .tflite)
├── Labels: (optional labels file)
├── Image: (image data)
└── Results ──▶ [{label: "cat", score: 0.95}, ...]
Use CaseRecommended Model
Quick classification baselineNaive Bayes
Need to explain predictionsDecision Tree
Maximum accuracy (small data)SVM
Predict continuous valuesLinear Regression
Group data (known K)K-Means
Find outliers & groupsDBSCAN
Reduce dimensionsPCA
Classify imagesONNX (TIMM)
Detect objectsONNX (YOLO)
Imbalanced classesUse Stratified Split first

Complete Example: Customer Churn Prediction

Section titled “Complete Example: Customer Churn Prediction”
┌────────────────────────────────────────────────────────────┐
│ │
│ Load CSV (customer data) │
│ │ │
│ ▼ │
│ Insert to Database │
│ │ │
│ ▼ │
│ Stratified Split (80/20) │
│ │ │
│ ├──▶ Train Set ──▶ Fit Decision Tree │
│ │ │ │
│ │ ▼ │
│ │ Model ────────────┐ │
│ │ │ │
│ └──▶ Test Set ────────────────────────┼──▶ Predict │
│ │ │ │
│ │ ▼ │
│ Confusion Matrix │
│ │ │
│ ▼ │
│ Save Model (if good) │
│ │
└────────────────────────────────────────────────────────────┘

Never evaluate on training data—it gives overly optimistic results.

Begin with Naive Bayes or Decision Trees, then try more complex models.

3. Use Stratified Splitting for Classification

Section titled “3. Use Stratified Splitting for Classification”

Especially important when classes are imbalanced.

Some algorithms (SVM, K-Means) are sensitive to feature scales. Consider normalizing.

Accuracy alone can be misleading. Check precision, recall, and F1.

Don’t retrain every time—save and load trained models.

  • Check for data quality issues
  • Try a different algorithm
  • Increase training data
  • Check for class imbalance
  • Reduce dataset size with sampling
  • Use smaller batch sizes
  • Try simpler algorithms (Naive Bayes)
  • Set MAX_RECORDS limit
  • Process in batches
  • Use sampling for very large datasets

With trained models: