Distributed Random Forest

A Python toolkit for teams that want more than a notebook demo: federated training, parallel orchestration, non-IID partitioning, aggregation strategy search, optional differential privacy, structured run reports, and CI/CD-ready packaging.

Federated and distributed Random Forest training with optional differential privacy, inspired by work on Random Forest with Differential Privacy in Federated Learning for Network Attack Detection.

Why This Project Exists¶

Most "distributed random forest" repos stop at "train a few local forests and concatenate trees." This project goes further:

Real federated orchestration Parallel client training with sequential, thread, or process backends.

Non-IID simulation Uniform, stratified, feature-based, Dirichlet, sized, and label-skew partitioning.

Aggregation strategy search Compare classic paper baselines with balanced and proportional selectors.

Enterprise reporting Export structured JSON reports for audits, benchmarks, and dashboards.

Maintainer¶

This project is maintained by Bowen Song, an AI Scientist and PhD candidate at USC Viterbi working across health AI, federated learning, explainable AI, and scalable machine-learning systems. The project’s positioning is intentionally research-aware but package-first: it should be useful for reproducible experiments, demos, and real engineering evaluation.

Personal site: bowenislandsong.github.io/#/personal
CV: Bowen_Song_Resume.pdf
ORCID: 0000-0002-5071-3880

What You Can Build¶

Privacy-sensitive security analytics where data must remain on client sites.
Multi-region fraud, risk, or IoT classifiers with heterogeneous data quality.
Research benchmarks for tree-selection strategies under non-IID client splits.
CI-verified Python packages that ship clean docs, examples, and release workflows.

Why This Implementation Stands Out¶

Area	Baseline script repo	This project
Federated workflow	ad hoc experiments	reusable orchestration API
Heterogeneity	mostly uniform data splits	uniform, stratified, feature, sized, Dirichlet, label skew
Aggregation	one or two selection rules	classic paper rules plus balanced, proportional, threshold, and auto search
Operational maturity	code only	package, CLI, CI, docs, GitHub Pages, release workflow

Performance Snapshot¶

Single-run local benchmark on a synthetic 4-class dataset. Reproduce with python examples/performance_benchmark.py (when the script is present in the repo).

Scenario	Accuracy	Time (s)	Strategy
Centralized RF	0.8642	0.35	n/a
Federated uniform	0.7842	1.44	`proportional_weighted_accuracy`
Federated dirichlet	0.7642	1.42	`proportional_weighted_accuracy`
Federated dirichlet + DP	0.5125	0.74	`top_k_global_balanced_accuracy`

Guides¶

Core concepts — splits, voting, tree aggregation, and metrics.
Patterns (parallel, DP, sharding) — n_jobs, data partitioning, aggregation strategies, and DP layout.
Experiment pipeline — how EXP 1–4 are organized.
Getting started — install, run experiments, and tests.
Code examples — copy-paste training and federated patterns.

Quick install¶

pip install distributed-random-forest

Or from source:

git clone https://github.com/Bowenislandsong/distributed_random_forest
cd distributed_random_forest
python -m pip install -e ".[dev,docs]"

Fastest way to try it¶

drf-quickstart --clients 4 --partition-strategy dirichlet --backend thread

Build this documentation locally¶

pip install -e ".[docs]"
mkdocs serve

Then open the URL printed in the terminal (usually http://127.0.0.1:8000).

Core building blocks¶

RandomForest and DPRandomForest for local training.
ClientRF and DPClientRF for client-scoped model ownership and metrics.
FederatedAggregator for explicit tree selection strategies.
FederatedRandomForest for end-to-end orchestration, validation, and reporting.

Next steps¶

Follow Getting started for installation and a first training run.
Read Distributed strategies & DP to choose partitioning and aggregation.
Use Operations to enable CI, release publishing, and GitHub Pages deployment.

Links¶

PyPI: distributed-random-forest
Source: GitHub