Difference between revisions of "Working with Random Variables"
(→Statistic Variables: Functions of Random Variables) |
|||
Line 38: | Line 38: | ||
=Statistic Variables: Functions of Random Variables= | =Statistic Variables: Functions of Random Variables= | ||
+ | |||
+ | The way Rave handles functions of [[Random variables]] is a little bit complicated. In "real life," a function of a [[random variable]] is also a [[random variable]]. However, there is no easy way for Rave to store and represent the full distribution of a [[random variable]]. Instead, when you have functions of [[random variables]], Rave represents their values to you as scalar-valued statistics. These statistics are calculated by sampling from the distribution many times (number is determined by the "[[Random Variable]] Sampling Size" preference) and then aggregating the resulting sampled values into a single number, which is the statistic. | ||
+ | |||
+ | Rave treats functions of [[random variables]] as statistics whenever it needs to calculate a single value of the function variable for each row of your data table. The most obvious example is that when you are viewing the table, you will see the statistic values listed. Similarly, when you are viewing most visualizations you will see the statistic values used. But there are some important exceptions: | ||
+ | |||
+ | *Note that Rave always calculates the statistics by some sort of Monte Carlo sampling (you can change how this sampling works by modifying the [[preferences]]). Consequently, Rave has access to the full sampled population and ''is capable of'' displaying and storing it. However, in general Rave does not display/store it. But the important point is that you can code special [[workspace objects]], explore methods, or other plugins that use the entire distribution. | ||
+ | |||
+ | *When you have a function of functions of [[random variables]], for example: f(g(x)), where x includes some [[random variables]], Rave will generally represent both f and g to you as statistics. However, '''(this is important)''' Rave does not calculate f using the statistic value of g(x). f is also calculated as a distribution using the actual sampled values of g(x). In other words, if we let s represent some statistic function that takes in a distribution and returns a scalar value, the statistic value of f(g(x)) is calculated as s(f(g(x))), NOT as s(f(s(g(x)))). In other other words, if you change the statistic used to represent g(x), for example you change from "mean" to "median", the values of f(g(x)) will not change because they are based on the actual sampled values of g(x), not the values of the statistic that Rave uses to represent g(x) to you. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Rave presumes that all functions you load are deterministic. In other words, for each unique input vector, x, the function always returns the same value of y. E.g. your functions never use the rand() function or otherwise introduce randomness inside the function; randomness only comes from randomly varying the inputs to the function. |
Revision as of 09:23, 17 October 2013
Contents
Introduction
Random variables are useful for modeling uncertainty in a system. Unlike regular variable, which are defined by (deterministic) values, random variables are defined by distributions. Rave supports the use of random variables through various sampling (Monte Carlo) based approaches. The main ways to use random variables in Rave are:
- Declare a variable to be random, in which case it is treated as being random for all purposes in Rave.
- Sample data according to random distributions. This lets you generate a data set that involves randomness, but Rave otherwise treats this data as if it were a deterministic sampling.
In either case, Rave samples the random variables according to distributions that you define in Rave. The process of defining distributions is described below.
Note: only independent variables can be treated as random. Function variables become "statistic variables" when one or more of their input variables are random. See below for details.
Defining and Modifying Distributions
Working with Random Variables
Method 1: Declaring a Variable to be Random
To declare a variable to be random, right click its column header in the main table and select Treat as a Random Variable.
To indicate that a particular variable is random, its name in the main table column header turns purple, and its values in the table are replaced with the word "Random".
The reasoning behind this is that random variables have no single value for each row in your data set, rather they have a distribution of values. Any variables that are functions of one or more random variables are also defined by distributions. In order to force these function variables to still have a single value, Rave automatically converts any variables that are functions of random variables into statistic variables. You will notice that the column headers of any such functions are changed to reflect this.
For example, suppose you have three variables, x,y, and z, such that z=f(x,y). If you change y into a random variable, you will see that the column header for z changes to something like "mean(z)" to indicate that the values displayed in the table are no longer the values of z itself, rather they are the mean value of z calculated over the distribution of y.
To calculate the statistics, Rave uses a sampling-based approach. This works as follows: suppose again that we have z=f(x,y), where y is random and x is deterministic. If Rave needs to calculate z for x=5 and y=N(0,1), (i.e. y is normally distributed with mean 0 and standard deviation 1), Rave samples many random values of y such that these sampled values are approximately distributed as N(0,1). Supposing 5000 such samples were used, Rave then evaluates z 5000 times, each time using x=5 and y=(each of the 5000 randomly sampled values in turn). This yields 5000 values of z, which Rave then aggregates back into a single value using the specified statistic, for example mean(z) would return the average of these 5000 values.
There are four parameters that you can change to customize Rave's approach to calculating functions of random variables:
- The distribution of each random variable
- The number of samples used to calculate statistic (5000 in the example above)
- The method by which the random values are sampled from the distributions
- The statistic used to aggregate the values of z back into a single value.
Very important: In the example above, it took 5000 evaluations of the function f to calculate the mean of z for the single case x=5. In reality, most activities in Rave require many samples. For example, creating a Contour plot that uses a 10x10 grid of points would require 100 deterministic cases, so for 5000 random samples it would require 500,000 function evaluations! Working with random variables is very computationally intensive, and consequently it is only feasible when your functions are extremely fast (think surrogate models or other algebraic functions).
Method 2: Sampling Data from Distributions
Statistic Variables: Functions of Random Variables
The way Rave handles functions of Random variables is a little bit complicated. In "real life," a function of a random variable is also a random variable. However, there is no easy way for Rave to store and represent the full distribution of a random variable. Instead, when you have functions of random variables, Rave represents their values to you as scalar-valued statistics. These statistics are calculated by sampling from the distribution many times (number is determined by the "Random Variable Sampling Size" preference) and then aggregating the resulting sampled values into a single number, which is the statistic.
Rave treats functions of random variables as statistics whenever it needs to calculate a single value of the function variable for each row of your data table. The most obvious example is that when you are viewing the table, you will see the statistic values listed. Similarly, when you are viewing most visualizations you will see the statistic values used. But there are some important exceptions:
- Note that Rave always calculates the statistics by some sort of Monte Carlo sampling (you can change how this sampling works by modifying the preferences). Consequently, Rave has access to the full sampled population and is capable of displaying and storing it. However, in general Rave does not display/store it. But the important point is that you can code special workspace objects, explore methods, or other plugins that use the entire distribution.
- When you have a function of functions of random variables, for example: f(g(x)), where x includes some random variables, Rave will generally represent both f and g to you as statistics. However, (this is important) Rave does not calculate f using the statistic value of g(x). f is also calculated as a distribution using the actual sampled values of g(x). In other words, if we let s represent some statistic function that takes in a distribution and returns a scalar value, the statistic value of f(g(x)) is calculated as s(f(g(x))), NOT as s(f(s(g(x)))). In other other words, if you change the statistic used to represent g(x), for example you change from "mean" to "median", the values of f(g(x)) will not change because they are based on the actual sampled values of g(x), not the values of the statistic that Rave uses to represent g(x) to you.
Rave presumes that all functions you load are deterministic. In other words, for each unique input vector, x, the function always returns the same value of y. E.g. your functions never use the rand() function or otherwise introduce randomness inside the function; randomness only comes from randomly varying the inputs to the function.