Fit, validation, and test data

From Rave Documentation
Jump to: navigation, search

Introduction

When creating a surrogate model, the rows in your data set are divided into three groups depending on how they will affect the model. Each type of surrogate model uses the fit and validation data slightly differently, but the main ideas are:

  • Fit Data - The fit data rows are directly used to create the model.
  • Validation Data - Some surrogate models use the validation data to test a surrogate model as it is being created (e.g. Neural Networks do this). While the model is not directly fit to the validation data, the validation data does indirectly influence the model creation.
  • Test Data - Is not used at all during the model creation process. Therefore you can use this data to test the model's generalization to data it has never seen before.

Note: Surrogate models that do not use validation data will treat any rows assigned to validation data as test data instead.

The Fit-Data Column

Rows are assigned to the fit, validation, and test sets by using a "fit-data column". This is a column in the data set whose values indicate which rows should be assigned to which set. The default settings are that a value of 0 indicates fit data, 1 indicates validation data, and 2 indicates test data. You can create a fit-data column by clicking the 'Make New Fit-Data Column' button in the Create New Model GUI. When you make a new column it will be appended to the corresponding data set so that you can reuse the set assignments when making multiple models.

In order to create a model, you must have some data assigned to the Fit Data set, but the Validation and Test sets may be empty. This is generally not recommended, however, because without a set of independent Test Data it is difficult to determine if a model has been "over-fit". (Overfitting indicates that the model is extremely accurate when reproducing the exact data used to create it, i.e. the fit data, but is otherwise a poor model.)