Data sets
Contents
Introduction
A data set is a collection of variables and any corresponding data values.
Important points about data sets:
- Data sets are either created by loading a new file, duplicating an existing data set, or creating a new data set from a design of experiments.
- You can have any number of data sets loaded in a single Rave session
- ...but each data set is completely independent of the others. There can be no interactions between data sets. Each graph displays data from a single data set, each optimizer acts on a single data set, etc.
- If a data set contains data, every variable must have the same number of rows of data.
Variables
A variable is the basic building block of your data sets in Rave. Variables are defined by the following attributes:
Variable Name
Each variable has a name. When you load a data set from a file, the variables names are taken from the first row of the file. Otherwise, Rave will ask you to name variables as they are created.
The following rules apply to variable names:
- Variable names cannot contain a tab character
- Variable names cannot contain a comma if your data file is comma delimited. For tab delimited or xls files, commas are ok.
- Variable names cannot contain the double quotes character: "
- Spaces are allowed; any leading/trailing spaces will be removed.
If your data set uses user-supplied functions that Rave must parse (i.e. plain text functions), the following rules also apply:
- Variable names must begin with a letter
- Only letters, numbers, spaces, and _ are allowable characters (no other symbols are allowed)
- Spaces are allowed, as Rave will internally replace these with "_" as needed.
See also: Renaming Variables
Variable Types
Each variable may be an:
Class
Each variable has one of the following classes that describes the information it encodes. When you load a new data set, Rave will attempt to determine the class of each variable. Typically numerical variables will begin as "continuous" and string variables will begin as "string". You can change these classes from the Manage Data Sets window.
Continuous numerical variables can take any value between their min and max values.
Integer numerical variables can take only integer values between their min and max values.
Discrete numerical variables can only take values from a user-defined list of allowable values.
Logical variables can only take 0 or 1 values. You can customize how these values appear on graphs, for example "True" vs "False" or "Yes" vs "No".
Text variables are like Discrete variables, but instead of taking numerical values they take any text string. Any operation that performs math cannot be used with string variables.
Note: Currently all dependent variables must be continuous, although the variable class is really only used by independent variables so this is not too important. (If the output of a function is discrete, then it doesn't matter if Rave thinks it is continuous.)
See also: Changing variable classes
Allowable Values (Independent Variables Only)
Depending on the class of variable, each independent variable has a limited set of values it is allowed to take. These are enforced whenever Rave lets the user choose a value, for example, by using a slider control.
Note that these limits may not be enforced when working with optimizers from the Optimization Toolbox or the Global Optimization Toolbox.
Continuous variables have defined minimum and maximum values. The variable can take any value within this range. When you load a new data set, these values are set to be the minimum and maximum values found in the data file.
Integer variables have defined minimum and maximum values. The variable can take any integer value within this range. When you load a new data set, these values are set to be the minimum and maximum values found in the data file.
Discrete variables have a list of allowable values. They can only take values from this list.
Logical variables can be true or false. You can optionally force them to always be true or always be false.
Text variables have a list of allowable strings. They can only take values from this list.
See also: Changing variables allowable values
Data and Current Values
Modifying Data Sets
There are several actions you can take in Rave to modify a data set. It will always be apparent when you are performing such an action. Working with graphs, optimizers, or surrogate models will never alter your data set without asking you first.
When you modify a data set, all graphs that use that data set will be automatically updated to reflect the new data.
Certain modifications will reset row colors, selection state, or visibility state. For example if you replace the entire data set, all rows will reset to their default state (visible, unselected, default color).
Some of the most common ways to modify a data set are:
- Add new columns by loading a function
- Add new rows by appending the results of an optimizer
- Change individual data values by editing the main table
- Delete rows by right clicking the main table header
- Add new rows by appending a design of experiments, or replace all rows in the data set
Working With Missing Data Values
Rave allows your data set to contain "missing values." The following rules generally apply:
- Currently, Rave only considers data to be missing for numerical variables. Missing data for string variables are simply considered to be a blank string.
- Missing data appears in the main data table as "NaN" colored light gray.
- Rave will not let you create a data set that contains a variable (column) in which every value is missing.
- If you modify a variable within Rave to make all data values NaN, bad things may happen.
- When you create a data set, any "cell" in an otherwise numerical column that is a string or is empty will be converted to a "NaN" missing value.
- You can replace missing numerical values in the main table by selecting the NaN cell and typing a number.
- To replace a number with NaN you must type case sensitive "NaN" in the cell.
- Each visualization will display missing data values differently, but in general visualization will only display rows of your data set in which EVERY variable being displayed in the visualization has no missing values.