Stratified Sampling

Question: Stratified sampling is a method of sampling that involves the division of a population to into smaller sub-groups known as strata. The size of each strata of the sample should be proportional to the size of each strata of the population. Define a function 'sfsmpl' that takes in a list or table, a column name that represents the column to base the strata on (for table data), an interval for creating strata bins (numeric strata values), and a sample size, and returns a sample of the data whose strata sizes are proportional to the strata sizes of the population. The strata can be numerical or symbols.

More Information:

https://en.wikipedia.org/wiki/Stratified_sampling

Example

                                
                                
q)d1:([]gender:1000000?`M`F;age:1+1000000?65;score:(1500+200000?901),800000?1500) 
 
// table, strata by gender values 
q)select count each age from 0!`gender xgroup sfsmpl[d1;`gender;`;1000] 
age 
--- 
500 
500 
 
// table, strata by score values 500 increment bins 
q)sfsmpl[d1;`score;500;1000] 
gender age score 
---------------- 
M      44  1623 
F      38  1703 
F      23  1578 
M      13  1561 
F      52  1910 
F      5   1814 
F      33  1645 
F      25  1719 
M      19  1501 
F      29  1853 
M      16  1958 
F      51  1698 
F      21  1895 
M      53  1525 
M      8   1761 
.. 
 
q)select count score by 500 xbar score from sfsmpl[d1;`score;500;1000] 
score| score 
-----| ----- 
0    | 267 
500  | 266 
1000 | 267 
1500 | 111 
2000 | 89 
 
// list, 500 increment bins 
q)count each group 500 xbar sfsmpl[d1`score;`;500;1000] 
1500| 111 
2000| 89 
500 | 266 
1000| 267 
0   | 267
                                
                            

Solution

Tags:
functions machine learning statistics
Searchable Tags
algorithms api architecture asynchronous c csv data structures dictionaries disk feedhandler finance functions ingestion ipc iterators machine learning math multithreading optimizations realtime shared library sql statistics streaming strings tables temporal utility websockets