Getting started¶

Installation¶

End users¶

pip install distributed-random-forest

Contributors¶

git clone https://github.com/Bowenislandsong/distributed_random_forest
cd distributed_random_forest
python -m pip install -e ".[dev,docs]"

Editable install is also fine with pip install -e . and optional extras as needed.

First federated run¶

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from distributed_random_forest import FederatedRandomForest

X, y = make_classification(
    n_samples=1200,
    n_features=20,
    n_classes=3,
    n_informative=10,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

model = FederatedRandomForest(
    n_clients=4,
    rf_params={"n_estimators": 24, "random_state": 42, "voting": "weighted"},
    partition_strategy="dirichlet",
    partition_kwargs={"alpha": 0.5},
    aggregation_strategy="auto",
    execution_backend="thread",
    max_workers=4,
    random_state=42,
)

model.fit(X_train, y_train)
metrics = model.evaluate(X_test, y_test)
print(model.selected_strategy)
print(metrics)

Build the documentation¶

pip install -e ".[docs]"
mkdocs serve    # local preview
# mkdocs build  # static site in ./site

Run experiment scripts¶

Stage	Command
EXP 1 — hyperparameters	`python run_exp1_hparams.py`
EXP 2 — per-client RFs	`python run_exp2_clients.py`
EXP 3 — federation	`python run_exp3_federation.py`
EXP 4 — DP federation	`python run_exp4_dp_federation.py`
UCI example (accuracy & latency)	`python examples/benchmark_public_dataset.py` — use `--quick` for a short run

Local quality checks¶

If the repo includes a Makefile:

make test
make lint
make docs
make build

Without make:

python -m pytest tests -q
python -m ruff check .
python -m mkdocs build --strict
python -m build

Run tests¶

pytest tests/ -v

With coverage of the distributed_random_forest package:

pytest tests/ -v --cov=distributed_random_forest

Targeted suites:

File	Focus
`tests/test_tree_utils.py`	Utilities and tree metrics
`tests/test_random_forest.py`	Core RF
`tests/test_dp_rf.py`	DP random forest
`tests/test_voting.py`	Voting
`tests/test_aggregator.py`	Aggregation
`tests/test_e2e.py`	End-to-end (synthetic)
`tests/test_e2e_public_dataset.py`	End-to-end (UCI breast cancer)
`tests/test_datasets.py`	Public dataset loader
`tests/test_performance.py`	Accuracy / latency bounds (marked `performance`)
`tests/test_examples_run.py`	Example script smoke test
`tests/test_parallel_e2e.py`	E2E: `n_jobs=1` vs `-1` parity (federated, EXP3)
`tests/test_parallel_stress.py`	Stress: many clients/trees, ranking load
`tests/test_parallelism.py`	`resolve_n_jobs`

Differential privacy¶

Differential privacy is optional. The built-in DP mode works without extra packages:

model = FederatedRandomForest(
    n_clients=5,
    rf_params={"n_estimators": 16, "random_state": 7},
    use_differential_privacy=True,
    epsilon=2.0,
)

For optional privacy tooling as well:

python -m pip install -e ".[privacy]"

Reports¶

Every orchestrated run can export a JSON report:

model.export_report("artifacts/federated-run.json")

That report includes:

client sample counts and training metrics
partition summaries
evaluated aggregation strategies
validation and final test metrics

Next steps¶

Supported distributed RF patterns for partitioning, aggregation, and DP layout.
Code examples for API usage.
Core concepts for design detail.