Distributed Random Forest
A Python toolkit for teams that want more than a notebook demo: federated training, parallel orchestration, non-IID partitioning, aggregation strategy search, optional differential privacy, structured run reports, and CI/CD-ready packaging.
Why This Project Exists¶
Most "distributed random forest" repos stop at "train a few local forests and concatenate trees." This project goes further:
Maintainer¶
This project is maintained by Bowen Song, an AI Scientist and PhD candidate at USC Viterbi working across health AI, federated learning, explainable AI, and scalable machine-learning systems. The project’s positioning is intentionally research-aware but package-first: it should be useful for reproducible experiments, demos, and real engineering evaluation.
- Personal site: bowenislandsong.github.io/#/personal
- CV: Bowen_Song_Resume.pdf
- ORCID: 0000-0002-5071-3880
What You Can Build¶
- Privacy-sensitive security analytics where data must remain on client sites.
- Multi-region fraud, risk, or IoT classifiers with heterogeneous data quality.
- Research benchmarks for tree-selection strategies under non-IID client splits.
- CI-verified Python packages that ship clean docs, examples, and release workflows.
Why This Implementation Stands Out¶
| Area | Baseline script repo | This project |
|---|---|---|
| Federated workflow | ad hoc experiments | reusable orchestration API |
| Heterogeneity | mostly uniform data splits | uniform, stratified, feature, sized, Dirichlet, label skew |
| Aggregation | one or two selection rules | classic paper rules plus balanced, proportional, threshold, and auto search |
| Operational maturity | code only | package, CLI, CI, docs, GitHub Pages, release workflow |
Performance Snapshot¶
Single-run local benchmark on a synthetic 4-class dataset. Reproduce with
python examples/performance_benchmark.py.
| Scenario | Accuracy | Time (s) | Strategy |
|---|---|---|---|
| Centralized RF | 0.8642 | 0.35 | n/a |
| Federated uniform | 0.7842 | 1.44 | proportional_weighted_accuracy |
| Federated dirichlet | 0.7642 | 1.42 | proportional_weighted_accuracy |
| Federated dirichlet + DP | 0.5125 | 0.74 | top_k_global_balanced_accuracy |
Quick Install¶
pip install distributed-random-forest
Or from source:
git clone https://github.com/Bowenislandsong/distributed_random_forest
cd distributed_random_forest
python -m pip install -e ".[dev,docs]"
Fastest Way To Try It¶
drf-quickstart --clients 4 --partition-strategy dirichlet --backend thread
Core Building Blocks¶
RandomForestandDPRandomForestfor local training.ClientRFandDPClientRFfor client-scoped model ownership and metrics.FederatedAggregatorfor explicit tree selection strategies.FederatedRandomForestfor end-to-end orchestration, validation, and reporting.
Next Steps¶
- Follow Getting Started for installation and a first training run.
- Read Distributed Strategies to choose partitioning and aggregation.
- Use Operations to enable CI, release publishing, and GitHub Pages deployment.