[Bug]: It seems that longer time budgets result in worse outputs

### Describe the bug

I have a data set where I have tried to optimise the hyperparameters on Flaml, and it seems that the model keeps getting worse, the longer I give it. Here is a simple example of the code I have for the model I am trying to optimise:

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import f1_score, confusion_matrix, classification_report, precision_score, recall_score
from flaml import AutoML
import numpy as np
import joblib

def create_and_train_pipeline(X_train, y_train, numerical_features, categorical_features, time_budget=60):
    """
    Creates and trains a pipeline without requiring custom wrapper class
    """
    # First, create and fit the preprocessor
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ],
        remainder='drop',
        sparse_threshold=0
    )
    
    # Fit the preprocessor first
    X_train_transformed = preprocessor.fit_transform(X_train)
    
    # Train AutoML on the transformed data
    automl = AutoML()
    
    # Train AutoML
    settings = {
        "time_budget": time_budget,
        "task": "classification",
        "estimator_list": ['lgbm', 'rf'],
        "eval_method": "cv",
        "metric": "f1",
        "n_splits": 5,
        "split_type": "stratified"
    }
    
    automl.fit(X_train_transformed, y_train, **settings)
    
    # Create final pipeline with best model
    final_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', automl.model.estimator)  # Use the best model directly
    ])
    
    # Print training results
    print(f"Best ML model:")
    print(automl.model.estimator)
    print("\nBest hyperparameter configuration:")
    print(automl.best_config)
    print("\nBest score on validation data: {:.4f}".format(automl.best_loss))
    
    # Generate and print test metrics
    y_pred = final_pipeline.predict(X_test)
    print("\nTraining Set Metrics:")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    
    # Save the pipeline
    joblib.dump(final_pipeline, 'full_prediction_pipeline.joblib')
    
    return final_pipeline, automl

if __name__ == "__main__":
    categorical_features = ['created_on', 'dex_id', 'price_confidence']
    numerical_features = [col for col in X_train.columns if col not in categorical_features]
    
    pipeline, automl = create_and_train_pipeline(
        X_train=X_train,
        y_train=y_train,
        numerical_features=numerical_features,
        categorical_features=categorical_features,
        time_budget=35
    )
```

Giving a minor f1 of 0.37 and a major f1 of 0.96 with a budget of 35 seconds:

```
Best score on validation data: 0.5886

Training Set Metrics:

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96       930
           1       0.32      0.45      0.37        49

    accuracy                           0.92       979
   macro avg       0.64      0.70      0.67       979
weighted avg       0.94      0.92      0.93       979


Confusion Matrix:
[[883  47]
 [ 27  22]]
```

If I increase it to 60 seconds I get a minor f1 of 0.34 and a major f1 of 0.96:

```
Best score on validation data: 0.5815

Training Set Metrics:

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96       930
           1       0.30      0.39      0.34        49

    accuracy                           0.92       979
   macro avg       0.63      0.67      0.65       979
weighted avg       0.93      0.92      0.93       979


Confusion Matrix:
[[885  45]
 [ 30  19]]
```

And after 120 seconds minor f1 of 0.33 and major f1 of 0.96:

```
Training Set Metrics:

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96       930
           1       0.29      0.39      0.33        49

    accuracy                           0.92       979
   macro avg       0.63      0.67      0.65       979
weighted avg       0.93      0.92      0.93       979


Confusion Matrix:
[[884  46]
 [ 30  19]]
```

I am wondering why it is doing this? The error in the logs seems to be getting reduced however the output model is worse. This seems to be the case even when I define my own custom metric (and negate the output of course). As the negative number is getting minimised (absolute value getting bigger), it seems to give a worse final confusion matrix. What am I doing wrong here? Thanks a lot

### Steps to reproduce

_No response_

### Model Used

_No response_

### Expected Behavior

_No response_

### Screenshots and logs

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: It seems that longer time budgets result in worse outputs #1394

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: It seems that longer time budgets result in worse outputs #1394

Description

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions