사이킷런 0.21 버전이 릴리스 되었습니다! RC 버전에서 언급되었던 히스토그램 기반 부스팅 알고리즘인 HistGradientBoostingClassifier, OPTICS 클러스터링 알고리즘, 누락된 값을 예측하여 채울 때 사용할 수 있는 IterativeImputer, NeighborhoodComponentsAnalysis 가 추가되었습니다.
0.21 버전은 pip로 설치할 수 있습니다. conda 패키지는 하루 이틀 걸릴 것 같네요.
$ pip install scikit-learn
이 중에 HistGradientBoostingClassifier와 IterativeImputer는 실험적인 기능이라 기본으로 활성화되어 있지 않습니다. 다음처럼 sklearn.experimental
모듈 아래를 참조해 주어야 합니다.
>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.experimental import enable_iterative_imputer
>>> from sklearn.impute import IterativeImputer
무슨 일인지 HistGradientBoostingClassifier 문서가 생성되지 않았네요. 급한대로 소스 코드에서 긁어 올립니다. 🙂
"""Histogram-based Gradient Boosting Classification Tree.
This estimator is much faster than
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
into integer-valued bins, which considerably reduces the number of
splitting points to consider, and allows the algorithm to leverage
integer-based data structures. For small sample sizes,
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
might be preferred since binning may lead to split points that are too
approximate in this setting.
This implementation is inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_.
.. note::
This estimator is still **experimental** for now: the predictions
and the API might change without any deprecation cycle. To use it,
you need to explicitly import ``enable_hist_gradient_boosting``::
>>> # explicitly require this experimental feature
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
>>> # now you can import normally from ensemble
>>> from sklearn.ensemble import HistGradientBoostingClassifier
Parameters
----------
loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \
optional (default='auto')
The loss function to use in the boosting process. 'binary_crossentropy'
(also known as logistic loss) is used for binary classification and
generalizes to 'categorical_crossentropy' for multiclass
classification. 'auto' will automatically choose either loss depending
on the nature of the problem.
learning_rate : float, optional (default=1)
The learning rate, also known as *shrinkage*. This is used as a
multiplicative factor for the leaves values. Use ``1`` for no
shrinkage.
max_iter : int, optional (default=100)
The maximum number of iterations of the boosting process, i.e. the
maximum number of trees for binary classification. For multiclass
classification, `n_classes` trees per iteration are built.
max_leaf_nodes : int or None, optional (default=31)
The maximum number of leaves for each tree. Must be strictly greater
than 1. If None, there is no maximum limit.
max_depth : int or None, optional (default=None)
The maximum depth of each tree. The depth of a tree is the number of
nodes to go from the root to the deepest leaf. Must be strictly greater
than 1. Depth isn't constrained by default.
min_samples_leaf : int, optional (default=20)
The minimum number of samples per leaf. For small datasets with less
than a few hundred samples, it is recommended to lower this value
since only very shallow trees would be built.
l2_regularization : float, optional (default=0)
The L2 regularization parameter. Use 0 for no regularization.
max_bins : int, optional (default=256)
The maximum number of bins to use. Before training, each feature of
the input array ``X`` is binned into at most ``max_bins`` bins, which
allows for a much faster training stage. Features with a small
number of unique values may use less than ``max_bins`` bins. Must be no
larger than 256.
scoring : str or callable or None, optional (default=None)
Scoring parameter to use for early stopping. It can be a single
string (see :ref:`scoring_parameter`) or a callable (see
:ref:`scoring`). If None, the estimator's default scorer
is used. If ``scoring='loss'``, early stopping is checked
w.r.t the loss value. Only used if ``n_iter_no_change`` is not None.
validation_fraction : int or float or None, optional (default=0.1)
Proportion (or absolute size) of training data to set aside as
validation data for early stopping. If None, early stopping is done on
the training data.
n_iter_no_change : int or None, optional (default=None)
Used to determine when to "early stop". The fitting process is
stopped when none of the last ``n_iter_no_change`` scores are better
than the ``n_iter_no_change - 1``th-to-last one, up to some
tolerance. If None or 0, no early-stopping is done.
tol : float or None, optional (default=1e-7)
The absolute tolerance to use when comparing scores. The higher the
tolerance, the more likely we are to early stop: higher tolerance
means that it will be harder for subsequent iterations to be
considered an improvement upon the reference score.
verbose: int, optional (default=0)
The verbosity level. If not zero, print some information about the
fitting process.
random_state : int, np.random.RandomStateInstance or None, \
optional (default=None)
Pseudo-random number generator to control the subsampling in the
binning process, and the train/validation data split if early stopping
is enabled. See :term:`random_state`.
Attributes
----------
n_iter_ : int
The number of estimators as selected by early stopping (if
n_iter_no_change is not None). Otherwise it corresponds to max_iter.
n_trees_per_iteration_ : int
The number of tree that are built at each iteration. This is equal to 1
for binary classification, and to ``n_classes`` for multiclass
classification.
train_score_ : ndarray, shape (max_iter + 1,)
The scores at each iteration on the training data. The first entry
is the score of the ensemble before the first iteration. Scores are
computed according to the ``scoring`` parameter. If ``scoring`` is
not 'loss', scores are computed on a subset of at most 10 000
samples. Empty if no early stopping.
validation_score_ : ndarray, shape (max_iter + 1,)
The scores at each iteration on the held-out validation data. The
first entry is the score of the ensemble before the first iteration.
Scores are computed according to the ``scoring`` parameter. Empty if
no early stopping or if ``validation_fraction`` is None.
Examples
--------
>>> # To use this experimental feature, we need to explicitly ask for it:
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = HistGradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
1.0
"""