태그 보관물: HistGradientBoostingClassifier

Scikit-Learn 0.21.1 Release

사이킷런 0.21.1 버전이 릴리스 되었습니다! 며칠 전에 릴리스된 0.21 버전의 버그를 수정했다고 하네요. 이제 HistGradientBoostingClassifier, HistGradientBoostingRegressor 클래스가 제대로 보입니다. 🙂

0.21.1 버전은 pip로 설치할 수 있습니다. conda 패키지는 아직 입니다.

$ pip install scikit-learn

Scikit-Learn 0.21.0 Release

사이킷런 0.21 버전이 릴리스 되었습니다! RC 버전에서 언급되었던 히스토그램 기반 부스팅 알고리즘인 HistGradientBoostingClassifier, OPTICS 클러스터링 알고리즘, 누락된 값을 예측하여 채울 때 사용할 수 있는 IterativeImputer, NeighborhoodComponentsAnalysis 가 추가되었습니다.

0.21 버전은 pip로 설치할 수 있습니다. conda 패키지는 하루 이틀 걸릴 것 같네요.

$ pip install scikit-learn

이 중에 HistGradientBoostingClassifier와 IterativeImputer는 실험적인 기능이라 기본으로 활성화되어 있지 않습니다. 다음처럼 sklearn.experimental 모듈 아래를 참조해 주어야 합니다.

>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.experimental import enable_iterative_imputer
>>> from sklearn.impute import IterativeImputer

무슨 일인지 HistGradientBoostingClassifier 문서가 생성되지 않았네요. 급한대로 소스 코드에서 긁어 올립니다. 🙂

"""Histogram-based Gradient Boosting Classification Tree.

This estimator is much faster than
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
into integer-valued bins, which considerably reduces the number of
splitting points to consider, and allows the algorithm to leverage
integer-based data structures. For small sample sizes,
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
might be preferred since binning may lead to split points that are too
approximate in this setting.

This implementation is inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_.

.. note::

  This estimator is still **experimental** for now: the predictions
  and the API might change without any deprecation cycle. To use it,
  you need to explicitly import ``enable_hist_gradient_boosting``::

    >>> # explicitly require this experimental feature
    >>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
    >>> # now you can import normally from ensemble
    >>> from sklearn.ensemble import HistGradientBoostingClassifier

Parameters
----------
loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \
        optional (default='auto')
    The loss function to use in the boosting process. 'binary_crossentropy'
    (also known as logistic loss) is used for binary classification and
    generalizes to 'categorical_crossentropy' for multiclass
    classification. 'auto' will automatically choose either loss depending
    on the nature of the problem.
learning_rate : float, optional (default=1)
    The learning rate, also known as *shrinkage*. This is used as a
    multiplicative factor for the leaves values. Use ``1`` for no
    shrinkage.
max_iter : int, optional (default=100)
    The maximum number of iterations of the boosting process, i.e. the
    maximum number of trees for binary classification. For multiclass
    classification, `n_classes` trees per iteration are built.
max_leaf_nodes : int or None, optional (default=31)
    The maximum number of leaves for each tree. Must be strictly greater
    than 1. If None, there is no maximum limit.
max_depth : int or None, optional (default=None)
    The maximum depth of each tree. The depth of a tree is the number of
    nodes to go from the root to the deepest leaf. Must be strictly greater
    than 1. Depth isn't constrained by default.
min_samples_leaf : int, optional (default=20)
    The minimum number of samples per leaf. For small datasets with less
    than a few hundred samples, it is recommended to lower this value
    since only very shallow trees would be built.
l2_regularization : float, optional (default=0)
    The L2 regularization parameter. Use 0 for no regularization.
max_bins : int, optional (default=256)
    The maximum number of bins to use. Before training, each feature of
    the input array ``X`` is binned into at most ``max_bins`` bins, which
    allows for a much faster training stage. Features with a small
    number of unique values may use less than ``max_bins`` bins. Must be no
    larger than 256.
scoring : str or callable or None, optional (default=None)
    Scoring parameter to use for early stopping. It can be a single
    string (see :ref:`scoring_parameter`) or a callable (see
    :ref:`scoring`). If None, the estimator's default scorer
    is used. If ``scoring='loss'``, early stopping is checked
    w.r.t the loss value. Only used if ``n_iter_no_change`` is not None.
validation_fraction : int or float or None, optional (default=0.1)
    Proportion (or absolute size) of training data to set aside as
    validation data for early stopping. If None, early stopping is done on
    the training data.
n_iter_no_change : int or None, optional (default=None)
    Used to determine when to "early stop". The fitting process is
    stopped when none of the last ``n_iter_no_change`` scores are better
    than the ``n_iter_no_change - 1``th-to-last one, up to some
    tolerance. If None or 0, no early-stopping is done.
tol : float or None, optional (default=1e-7)
    The absolute tolerance to use when comparing scores. The higher the
    tolerance, the more likely we are to early stop: higher tolerance
    means that it will be harder for subsequent iterations to be
    considered an improvement upon the reference score.
verbose: int, optional (default=0)
    The verbosity level. If not zero, print some information about the
    fitting process.
random_state : int, np.random.RandomStateInstance or None, \
    optional (default=None)
    Pseudo-random number generator to control the subsampling in the
    binning process, and the train/validation data split if early stopping
    is enabled. See :term:`random_state`.

Attributes
----------
n_iter_ : int
    The number of estimators as selected by early stopping (if
    n_iter_no_change is not None). Otherwise it corresponds to max_iter.
n_trees_per_iteration_ : int
    The number of tree that are built at each iteration. This is equal to 1
    for binary classification, and to ``n_classes`` for multiclass
    classification.
train_score_ : ndarray, shape (max_iter + 1,)
    The scores at each iteration on the training data. The first entry
    is the score of the ensemble before the first iteration. Scores are
    computed according to the ``scoring`` parameter. If ``scoring`` is
    not 'loss', scores are computed on a subset of at most 10 000
    samples. Empty if no early stopping.
validation_score_ : ndarray, shape (max_iter + 1,)
    The scores at each iteration on the held-out validation data. The
    first entry is the score of the ensemble before the first iteration.
    Scores are computed according to the ``scoring`` parameter. Empty if
    no early stopping or if ``validation_fraction`` is None.

Examples
--------
>>> # To use this experimental feature, we need to explicitly ask for it:
>>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = HistGradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
1.0
"""

Scikit-Learn 0.21 RC Release

사이킷런 0.21 RC 버전이 릴리스 되었습니다. 일전에 소개해 드렸던 히스토그램 기반의 부스팅 트리 알고리즘인 HistGradientBoostingClassifierHistGradientBoostingRegressor가 가장 주목을 받고 있습니다. 샘플이 만 개 이상이면 기존의 그래디언트 부스팅보다 훨씬 빠릅니다. 이 클래스들은 마이크로소프트의 LightGBM에 영향을 받아 만들어진 pygbm의 사이킷런 포팅입니다. 히스토그램 기반 부스팅 트리는 캐글에서 가장 많이 사용하는 알고리즘 중 하나입니다.

그외에도 많은 기능이 추가되었습니다. 눈에 띠는 것은 다음과 같습니다.

  • OPTICS 클러스터링 알고리즘이 추가되었습니다. DBSCAN와 유사하지만 매개변수 설정이 쉽고 대용량 데이터셋에도 잘 동작합니다.
  • 데이터셋에서 한 특성을 타깃으로 정하고 나머지 특성을 사용하여 누락된 값을 예측하는 IterativeImputer가 추가되었습니다. 타깃 열을 바꾸어 가며 반복합니다. 모델링에 사용하는 기본 추정기는 BayesianRidge 클래스입니다.
  • 샘플 간의 거리 지표를 학습(metric learning)하여 차원 축소로도 활용할 수 있는 NeighborhoodComponentsAnalysis(NCA)가 추가되었습니다.

0.21 버전의 자세한 변경 사항은 What’s new 페이지를 참고하세요.

0.21 RC 버전은 다음과 같이 설치할 수 있습니다.

pip install scikit-learn==0.21rc2