사이킷런 0.21 버전이 릴리스 되었습니다! RC 버전에서 언급되었던 히스토그램 기반 부스팅 알고리즘인 HistGradientBoostingClassifier, OPTICS 클러스터링 알고리즘, 누락된 값을 예측하여 채울 때 사용할 수 있는 IterativeImputer, NeighborhoodComponentsAnalysis 가 추가되었습니다.
0.21 버전은 pip로 설치할 수 있습니다. conda 패키지는 하루 이틀 걸릴 것 같네요.
$ pip install scikit-learn
이 중에 HistGradientBoostingClassifier와 IterativeImputer는 실험적인 기능이라 기본으로 활성화되어 있지 않습니다. 다음처럼 sklearn.experimental
모듈 아래를 참조해 주어야 합니다.
>>> from sklearn.experimental import enable_hist_gradient_boosting >>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.experimental import enable_iterative_imputer >>> from sklearn.impute import IterativeImputer
무슨 일인지 HistGradientBoostingClassifier 문서가 생성되지 않았네요. 급한대로 소스 코드에서 긁어 올립니다. 🙂
"""Histogram-based Gradient Boosting Classification Tree. This estimator is much faster than :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>` for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned into integer-valued bins, which considerably reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures. For small sample sizes, :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>` might be preferred since binning may lead to split points that are too approximate in this setting. This implementation is inspired by `LightGBM <https://github.com/Microsoft/LightGBM>`_. .. note:: This estimator is still **experimental** for now: the predictions and the API might change without any deprecation cycle. To use it, you need to explicitly import ``enable_hist_gradient_boosting``:: >>> # explicitly require this experimental feature >>> from sklearn.experimental import enable_hist_gradient_boosting # noqa >>> # now you can import normally from ensemble >>> from sklearn.ensemble import HistGradientBoostingClassifier Parameters ---------- loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \ optional (default='auto') The loss function to use in the boosting process. 'binary_crossentropy' (also known as logistic loss) is used for binary classification and generalizes to 'categorical_crossentropy' for multiclass classification. 'auto' will automatically choose either loss depending on the nature of the problem. learning_rate : float, optional (default=1) The learning rate, also known as *shrinkage*. This is used as a multiplicative factor for the leaves values. Use ``1`` for no shrinkage. max_iter : int, optional (default=100) The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, `n_classes` trees per iteration are built. max_leaf_nodes : int or None, optional (default=31) The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit. max_depth : int or None, optional (default=None) The maximum depth of each tree. The depth of a tree is the number of nodes to go from the root to the deepest leaf. Must be strictly greater than 1. Depth isn't constrained by default. min_samples_leaf : int, optional (default=20) The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built. l2_regularization : float, optional (default=0) The L2 regularization parameter. Use 0 for no regularization. max_bins : int, optional (default=256) The maximum number of bins to use. Before training, each feature of the input array ``X`` is binned into at most ``max_bins`` bins, which allows for a much faster training stage. Features with a small number of unique values may use less than ``max_bins`` bins. Must be no larger than 256. scoring : str or callable or None, optional (default=None) Scoring parameter to use for early stopping. It can be a single string (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`). If None, the estimator's default scorer is used. If ``scoring='loss'``, early stopping is checked w.r.t the loss value. Only used if ``n_iter_no_change`` is not None. validation_fraction : int or float or None, optional (default=0.1) Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. n_iter_no_change : int or None, optional (default=None) Used to determine when to "early stop". The fitting process is stopped when none of the last ``n_iter_no_change`` scores are better than the ``n_iter_no_change - 1``th-to-last one, up to some tolerance. If None or 0, no early-stopping is done. tol : float or None, optional (default=1e-7) The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score. verbose: int, optional (default=0) The verbosity level. If not zero, print some information about the fitting process. random_state : int, np.random.RandomStateInstance or None, \ optional (default=None) Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. See :term:`random_state`. Attributes ---------- n_iter_ : int The number of estimators as selected by early stopping (if n_iter_no_change is not None). Otherwise it corresponds to max_iter. n_trees_per_iteration_ : int The number of tree that are built at each iteration. This is equal to 1 for binary classification, and to ``n_classes`` for multiclass classification. train_score_ : ndarray, shape (max_iter + 1,) The scores at each iteration on the training data. The first entry is the score of the ensemble before the first iteration. Scores are computed according to the ``scoring`` parameter. If ``scoring`` is not 'loss', scores are computed on a subset of at most 10 000 samples. Empty if no early stopping. validation_score_ : ndarray, shape (max_iter + 1,) The scores at each iteration on the held-out validation data. The first entry is the score of the ensemble before the first iteration. Scores are computed according to the ``scoring`` parameter. Empty if no early stopping or if ``validation_fraction`` is None. Examples -------- >>> # To use this experimental feature, we need to explicitly ask for it: >>> from sklearn.experimental import enable_hist_gradient_boosting # noqa >>> from sklearn.ensemble import HistGradientBoostingRegressor >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> clf = HistGradientBoostingClassifier().fit(X, y) >>> clf.score(X, y) 1.0 """