-
Notifications
You must be signed in to change notification settings - Fork 546
Description
Describe the bug
Hi @thinkall,
I think that I may have found an issue in FLAML, although it's possible it was a deliberate choice by the developers. Basically if we use the holdout strategy for classification tasks, then we will find that:
len(input_data) < len(training_data) + len(test_data).
This occurs even when I set auto_augment=False, so up-sampling of data is not the issue here.
Steps to reproduce
If I run a classification task against the Iris dataset then my input dataset has 150 rows.
If I then analyse the automl state afterwards, I can see that we have 135 rows in automl._state.X_train, and 18 rows in automl._state.X_val - making 153 rows in total - we have 3 too many rows in total. The code to reproduce this is pasted below:
from flaml import AutoML
from sklearn import datasets
import numpy as np
dic_data = datasets.load_iris(as_frame=True) # numpy arrays
iris_data = dic_data["frame"] # pandas dataframe data + target
rng = np.random.default_rng(42)
iris_data["cluster"] = rng.integers(
low=0, high=5, size=iris_data.shape[0]
)
print(iris_data["cluster"])
print('shape at start', iris_data.shape)
automl = AutoML()
automl_settings = {
"max_iter":5,
"metric": 'accuracy',
"task": 'classification',
"log_file_name": "holdout_test.log",
"log_type": "all",
"estimator_list": ['lgbm'],
"eval_method": "holdout",
"split_type":"stratified",
"keep_search_state":True,
"retrain_full":True,
"auto_augment":False,
}
x_train = iris_data[["sepal length (cm)","sepal width (cm)", "petal length (cm)","petal width (cm)"]].to_numpy()
y_train = iris_data['target']
automl.fit(x_train, y_train, **automl_settings)
print(len(automl._state.X_train), len(automl._state.X_train_all), len(automl._state.X_val))
print(len(automl._state.y_train), len(automl._state.y_train_all), len(automl._state.y_val))
My colleague @drwillcharles has identified the cause of this issue, which is found in the prepare_data method on flaml/automl/task/generic_task.py:
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_rest, y_rest, first, rest, split_ratio, stratify
)
X_train = concat(X_first, X_train)
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
Here the first row containing each class has been extracted from the original training dataset (within X_first), and then once _train_test_split has been used to split the data, these rows are then added back to both the training and test datasets.
I'm not sure if this was an error, or a deliberate choice by the original developers.
I think that the advantage of this code would be that you guarantee that the training and testing datasets both contain at least one instance of every class. The disadvantage of this is that you have an overlap of training and test data, which will bias the models.
Possible Solution
Could I please ask your thoughts on this? Personally if you were going to keep this code, I'd rather you only applied it when it was required, and not in every case.
Perhaps we could just _train_test_split on the entire dataset (including X_first) and then only duplicate X_first if it's necessary (i.e. either the training or test set doesn't contain one/any of the classes)?
Please let me know if I'm misunderstanding anything - I welcome your thoughts.
Thanks!
Model Used
Random Forest in this example but this error has been present for all models tested.
Expected Behavior
No response
Screenshots and logs
No response
Additional Information
Python 3.10.3
FLAML 2.3.2