Skip to content

Getting started

Installation

End users

pip install distributed-random-forest

Contributors

git clone https://github.com/Bowenislandsong/distributed_random_forest
cd distributed_random_forest
python -m pip install -e ".[dev,docs]"

Editable install is also fine with pip install -e . and optional extras as needed.

First federated run

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from distributed_random_forest import FederatedRandomForest

X, y = make_classification(
    n_samples=1200,
    n_features=20,
    n_classes=3,
    n_informative=10,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

model = FederatedRandomForest(
    n_clients=4,
    rf_params={"n_estimators": 24, "random_state": 42, "voting": "weighted"},
    partition_strategy="dirichlet",
    partition_kwargs={"alpha": 0.5},
    aggregation_strategy="auto",
    execution_backend="thread",
    max_workers=4,
    random_state=42,
)

model.fit(X_train, y_train)
metrics = model.evaluate(X_test, y_test)
print(model.selected_strategy)
print(metrics)

Build the documentation

pip install -e ".[docs]"
mkdocs serve    # local preview
# mkdocs build  # static site in ./site

Run experiment scripts

Stage Command
EXP 1 — hyperparameters python run_exp1_hparams.py
EXP 2 — per-client RFs python run_exp2_clients.py
EXP 3 — federation python run_exp3_federation.py
EXP 4 — DP federation python run_exp4_dp_federation.py
UCI example (accuracy & latency) python examples/benchmark_public_dataset.py — use --quick for a short run

Local quality checks

If the repo includes a Makefile:

make test
make lint
make docs
make build

Without make:

python -m pytest tests -q
python -m ruff check .
python -m mkdocs build --strict
python -m build

Run tests

pytest tests/ -v

With coverage of the distributed_random_forest package:

pytest tests/ -v --cov=distributed_random_forest

Targeted suites:

File Focus
tests/test_tree_utils.py Utilities and tree metrics
tests/test_random_forest.py Core RF
tests/test_dp_rf.py DP random forest
tests/test_voting.py Voting
tests/test_aggregator.py Aggregation
tests/test_e2e.py End-to-end (synthetic)
tests/test_e2e_public_dataset.py End-to-end (UCI breast cancer)
tests/test_datasets.py Public dataset loader
tests/test_performance.py Accuracy / latency bounds (marked performance)
tests/test_examples_run.py Example script smoke test
tests/test_parallel_e2e.py E2E: n_jobs=1 vs -1 parity (federated, EXP3)
tests/test_parallel_stress.py Stress: many clients/trees, ranking load
tests/test_parallelism.py resolve_n_jobs

Differential privacy

Differential privacy is optional. The built-in DP mode works without extra packages:

model = FederatedRandomForest(
    n_clients=5,
    rf_params={"n_estimators": 16, "random_state": 7},
    use_differential_privacy=True,
    epsilon=2.0,
)

For optional privacy tooling as well:

python -m pip install -e ".[privacy]"

Reports

Every orchestrated run can export a JSON report:

model.export_report("artifacts/federated-run.json")

That report includes:

  • client sample counts and training metrics
  • partition summaries
  • evaluated aggregation strategies
  • validation and final test metrics

Next steps