태그 보관물: 0.21

Scikit-Learn 0.21.0 Release

사이킷런 0.21 버전이 릴리스 되었습니다! RC 버전에서 언급되었던 히스토그램 기반 부스팅 알고리즘인 HistGradientBoostingClassifier, OPTICS 클러스터링 알고리즘, 누락된 값을 예측하여 채울 때 사용할 수 있는 IterativeImputer, NeighborhoodComponentsAnalysis 가 추가되었습니다.

0.21 버전은 pip로 설치할 수 있습니다. conda 패키지는 하루 이틀 걸릴 것 같네요.

$ pip install scikit-learn

이 중에 HistGradientBoostingClassifier와 IterativeImputer는 실험적인 기능이라 기본으로 활성화되어 있지 않습니다. 다음처럼 sklearn.experimental 모듈 아래를 참조해 주어야 합니다.

>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.experimental import enable_iterative_imputer
>>> from sklearn.impute import IterativeImputer

무슨 일인지 HistGradientBoostingClassifier 문서가 생성되지 않았네요. 급한대로 소스 코드에서 긁어 올립니다. 🙂

"""Histogram-based Gradient Boosting Classification Tree.

This estimator is much faster than
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
into integer-valued bins, which considerably reduces the number of
splitting points to consider, and allows the algorithm to leverage
integer-based data structures. For small sample sizes,
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
might be preferred since binning may lead to split points that are too
approximate in this setting.

This implementation is inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_.

.. note::

  This estimator is still **experimental** for now: the predictions
  and the API might change without any deprecation cycle. To use it,
  you need to explicitly import ``enable_hist_gradient_boosting``::

    >>> # explicitly require this experimental feature
    >>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
    >>> # now you can import normally from ensemble
    >>> from sklearn.ensemble import HistGradientBoostingClassifier

Parameters
----------
loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \
        optional (default='auto')
    The loss function to use in the boosting process. 'binary_crossentropy'
    (also known as logistic loss) is used for binary classification and
    generalizes to 'categorical_crossentropy' for multiclass
    classification. 'auto' will automatically choose either loss depending
    on the nature of the problem.
learning_rate : float, optional (default=1)
    The learning rate, also known as *shrinkage*. This is used as a
    multiplicative factor for the leaves values. Use ``1`` for no
    shrinkage.
max_iter : int, optional (default=100)
    The maximum number of iterations of the boosting process, i.e. the
    maximum number of trees for binary classification. For multiclass
    classification, `n_classes` trees per iteration are built.
max_leaf_nodes : int or None, optional (default=31)
    The maximum number of leaves for each tree. Must be strictly greater
    than 1. If None, there is no maximum limit.
max_depth : int or None, optional (default=None)
    The maximum depth of each tree. The depth of a tree is the number of
    nodes to go from the root to the deepest leaf. Must be strictly greater
    than 1. Depth isn't constrained by default.
min_samples_leaf : int, optional (default=20)
    The minimum number of samples per leaf. For small datasets with less
    than a few hundred samples, it is recommended to lower this value
    since only very shallow trees would be built.
l2_regularization : float, optional (default=0)
    The L2 regularization parameter. Use 0 for no regularization.
max_bins : int, optional (default=256)
    The maximum number of bins to use. Before training, each feature of
    the input array ``X`` is binned into at most ``max_bins`` bins, which
    allows for a much faster training stage. Features with a small
    number of unique values may use less than ``max_bins`` bins. Must be no
    larger than 256.
scoring : str or callable or None, optional (default=None)
    Scoring parameter to use for early stopping. It can be a single
    string (see :ref:`scoring_parameter`) or a callable (see
    :ref:`scoring`). If None, the estimator's default scorer
    is used. If ``scoring='loss'``, early stopping is checked
    w.r.t the loss value. Only used if ``n_iter_no_change`` is not None.
validation_fraction : int or float or None, optional (default=0.1)
    Proportion (or absolute size) of training data to set aside as
    validation data for early stopping. If None, early stopping is done on
    the training data.
n_iter_no_change : int or None, optional (default=None)
    Used to determine when to "early stop". The fitting process is
    stopped when none of the last ``n_iter_no_change`` scores are better
    than the ``n_iter_no_change - 1``th-to-last one, up to some
    tolerance. If None or 0, no early-stopping is done.
tol : float or None, optional (default=1e-7)
    The absolute tolerance to use when comparing scores. The higher the
    tolerance, the more likely we are to early stop: higher tolerance
    means that it will be harder for subsequent iterations to be
    considered an improvement upon the reference score.
verbose: int, optional (default=0)
    The verbosity level. If not zero, print some information about the
    fitting process.
random_state : int, np.random.RandomStateInstance or None, \
    optional (default=None)
    Pseudo-random number generator to control the subsampling in the
    binning process, and the train/validation data split if early stopping
    is enabled. See :term:`random_state`.

Attributes
----------
n_iter_ : int
    The number of estimators as selected by early stopping (if
    n_iter_no_change is not None). Otherwise it corresponds to max_iter.
n_trees_per_iteration_ : int
    The number of tree that are built at each iteration. This is equal to 1
    for binary classification, and to ``n_classes`` for multiclass
    classification.
train_score_ : ndarray, shape (max_iter + 1,)
    The scores at each iteration on the training data. The first entry
    is the score of the ensemble before the first iteration. Scores are
    computed according to the ``scoring`` parameter. If ``scoring`` is
    not 'loss', scores are computed on a subset of at most 10 000
    samples. Empty if no early stopping.
validation_score_ : ndarray, shape (max_iter + 1,)
    The scores at each iteration on the held-out validation data. The
    first entry is the score of the ensemble before the first iteration.
    Scores are computed according to the ``scoring`` parameter. Empty if
    no early stopping or if ``validation_fraction`` is None.

Examples
--------
>>> # To use this experimental feature, we need to explicitly ask for it:
>>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = HistGradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
1.0
"""

Fast Gradient Boosting Tree

사이킷런에 xgboost와 같은 고급 그래디언트 부스팅 알고리즘인 FastGradientBoostingClassifier와 FastGradientBoostingRegressor가 추가될 예정입니다. 이 두 클래스는 pygbm의 사이킷런 포팅입니다. pygbm은 마이크로소프트의 LightGBM의 히스토그램 바이닝(histogram-binning) 방식을 사용하는 부스팅 트리의 파이썬 구현입니다. LightGBM이 xgboost 보다 성능이 조금 더 높거나 거의 비슷한 수준으로 알려져 있습니다. 사이킷런에 추가된 새 클래스가 LightGBM 만큼은 아니더라도 비슷한 수준의 성능이 나온다면 좋겠네요. 이 두 클래스는 사이킷런 0.21 버전에 추가될 예정입니다. 🙂