Split Training and Testing Sets

Question: When training machine learning models, typically at least two sets of data are used: the training set and the testing set. These sets are distinct subsets from a pool of data. Define a function 'stt' that takes in a list or table and a test set ratio, splits the data into two lists/tables based on the test set ratio, and returns a two item list with the first item representing the training set and the second item representing the test set.

Example

                                
                                q)show t:([]s:100?`AAPL`MSFT`IBM`TSLA;p:100?400)
s    p
--------
AAPL 372
AAPL 326
TSLA 329
IBM  330
TSLA 50
..
q)s:stt[t;.2]
q)s
+`s`p!(`AAPL`AAPL`TSLA`IBM`IBM`IBM`AAPL`TSLA`TSLA`IBM`MSFT`TSLA`AAPL`AAPL`IBM`IBM`TSLA`IBM`MSFT`TSLA`MSFT`TSLA`AAPL`IBM`AAPL`MSFT`AAPL`IBM`AAPL`AAPL`IBM`MSFT`TSLA`TSLA`AAPL`AAPL`TSLA`MSFT`MSFT`TSLA..
+`s`p!(`AAPL`TSLA`MSFT`IBM`MSFT`TSLA`MSFT`IBM`MSFT`MSFT`MSFT`TSLA`IBM`MSFT`TSLA`TSLA`IBM`TSLA`MSFT`AAPL;213 50 58 97 232 152 67 351 25 323 13 263 224 29 2 383 114 187 137 393)
q)s[0]
s    p
--------
AAPL 372
AAPL 326
TSLA 329
IBM  330
IBM  251
..
q)count s[0]
80
q)count s[1]
20
q)stt[`a`b`c`d`e`f`g;.2]
`a`b`c`d`f`g
,`e
                                
                            

Solution

Tags:
functions machine learning
Searchable Tags
algorithms api architecture csv data structures dictionaries disk feedhandler finance functions ingestion ipc iterators machine learning math multithreading optimizations realtime sql statistics streaming strings tables temporal websockets

Email sent!

Email not sent