Working with Random Variables
Contents
Introduction
Random variables are useful for modeling uncertainty in a system. Unlike regular variable, which are defined by (deterministic) values, random variables are defined by distributions. Rave supports the use of random variables through various sampling (Monte Carlo) based approaches. The main ways to use random variables in Rave are:
- Declare a variable to be random, in which case it is treated as being random for all purposes in Rave.
- Sample data according to random distributions. This lets you generate a data set that involves randomness, but Rave otherwise treats this data as if it were a deterministic sampling.
In either case, Rave samples the random variables according to distributions that you define in Rave. The process of defining distributions is described below.
Note: only independent variables can be treated as random. Function variables become random when one or more of their input variables are random.
Defining and Modifying Distributions
Working with Random Variables
Method 1: Declaring a Variable to be Random
To declare a variable to be random, right click its column header in the main table and select Treat as a Random Variable.
To indicate that a particular variable is random, its name in the main table column header turns purple, and its values in the table are replaced with the word "Random".
The reasoning behind this is that random variables have no single value for each row in your data set, rather they have a distribution of values. Any variables that are functions of one or more random variables are also defined by distributions. In order to force these function variables to still have a single value, Rave automatically converts any variables that are functions of random variables into statistic variables. You will notice that the column headers of any such functions are changed to reflect this.
For example, suppose you have three variables, x,y, and z, such that z=f(x,y). If you change y into a random variable, you will see that the column header for z changes to something like "mean(z)" to indicate that the values displayed in the table are no longer the values of z itself, rather they are the mean value of z calculated over the distribution of y.
To calculate the statistics, Rave uses a sampling-based approach. This works as follows: suppose again that we have z=f(x,y), where y is random and x is deterministic. If Rave needs to calculate z for x=5 and y=N(0,1), (i.e. y is normally distributed with mean 0 and standard deviation 1), Rave samples many random values of y such that these sampled values are approximately distributed as N(0,1). Supposing 5000 such samples were used, Rave then evaluates z 5000 times, each time using x=5 and y=(each of the 5000 randomly sampled values in turn). This yields 5000 values of z, which Rave then aggregates back into a single value using the specified statistic, for example mean(z) would return the average of these 5000 values.
There are four parameters that you can change to customize Rave's approach to calculating functions of random variables:
- The distribution of each random variable
- The number of samples used to calculate statistic (5000 in the example above)
- The method by which the random values are sampled from the distributions
- The statistic used to aggregate the values of z back into a single value.
Very important: In the example above, it took 5000 evaluations of the function f to calculate the mean of z for the single case x=5. In reality, most activities in Rave require many samples. For example, creating a Contour plot that uses a 10x10 grid of points would require 100 deterministic cases, so for 5000 random samples it would require 500,000 function evaluations! Working with random variables is very computationally intensive, and consequently it is only feasible when your functions are extremely fast (think surrogate models or other algebraic functions).