Databases, symbols, and tick types#
The data is organized by database (e.g., NYSE_TAQ
or CME
), symbol (e.g., AAPL
), and tick type (e.g., TRD
or QTE
). A data source must specify these (possibly providing multiple values for each) to retrieve data.
A tick type can be set explicitly via the tick_type
parameter of a data source.
The database name can be specified via the db
parameter or as part of a symbol using the ::
separator.
For example
>>> trades = otp.DataSource(db='NYSE_TAQ', symbol='AAPL', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> volume()
Time VOLUME
0 2022-03-02 90604833
is equivalent to
>>> trades = otp.DataSource(symbol='NYSE_TAQ::AAPL', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> trades.count()
749660
The same technique applies to the database and tick type:
>>> trades = otp.DataSource(symbol='AAPL', tick_type='NYSE_TAQ::TRD', date=otp.dt(2022, 3, 1))
>>> trades.count()
749660
Symbols: bound and unbound#
There are two ways of setting symbols: bound and unbound.
Bound symbols are specified when defining a source or with merge
.
Unbound symbols are set when the query is executed and apply to all sources that do not specify bound symbols.
Here is an simple example of an unbound symbol:
>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> volume(symbols=['AAPL']) # unbound symbol is set here
Time VOLUME
0 2022-03-02 90604833
There is no difference between bound and unbound when we talk about a single symbol. To appreciate the difference, let’s look at two symbols.
>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> dict(sorted(volume(symbols=['AAPL', 'MSFT']).items()))
{'AAPL': Time VOLUME
0 2022-03-02 90604833,
'MSFT': Time VOLUME
0 2022-03-02 35035286}
The results for each unbound symbol are processed separately and returned in a separate pandas DataFrame. The result of the run method above is a dict with symbols as keys and pandas DataFrames as values.
Contrast this with bound symbols where the ticks are merged into a single flow:
>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1), symbols=['AAPL', 'MSFT'])
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> volume()
Time VOLUME
0 2022-03-02 125640119
Specifying bound symbols on a source is just a shorthand for the merge
method added right
after the source definition.
>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> cross_symbol_trades = otp.merge([trades], symbols=['AAPL', 'MSFT'])
>>> volume = cross_symbol_trades.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> volume()
Time VOLUME
0 2022-03-02 125640119
Concurrency#
Unbound symbols can be processed in parallel by specifying the concurrency
parameter of otp.run
.
You have to explicitly use otp.run
to execute the query if you want to specify concurrency
.
otp.run(volume, symbols=['AAPL', 'MSFT', 'AMZN', 'META'], concurrency=8)
For bound symbols, all calculations passed into merge
(trades
in example above) run in parallel.
Calculations can be made faster by computing as much as possible per symbol before merging them.
The following example has the same result as the previous one but it finds the total volume faster as it calculates
the volume for every symbol independently which can be done in parallel and then adds up the total.
>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> cross_symbol_volumes = otp.merge([volume], symbols=['AAPL', 'MSFT'], presort=True, identify_input_ts=True)
>>> otp.run(cross_symbol_volumes, concurrency=8)
Time VOLUME SYMBOL_NAME TICK_TYPE
0 2022-03-02 90604833 AAPL TRD
1 2022-03-02 35035286 MSFT TRD
>>> total_volume = cross_symbol_volumes.agg({'TOTAL_VOLUME': otp.agg.sum('VOLUME')})
>>> otp.run(total_volume, concurrency=8)
Time TOTAL_VOLUME
0 2022-03-02 125640119
Note
All sources passed into merge
are considered bound symbol sources.
Symbols#
Specifying symbols dynamically#
In many cases it is necessary to select symbols from a databases according to some logic.
The Symbols
source makes this easy.
>>> symbols = otp.Symbols(db='NYSE_TAQ', date=otp.dt(2022,8,17), pattern='AAA%')
>>> symbols()
Time SYMBOL_NAME
0 2022-08-17 AAA
1 2022-08-17 AAAU
The result of Symbols
can be used as bound/unbound symbols.
>>> data = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2))
>>> data = data.first(3)
>>> data = data[['PRICE','SIZE']]
>>> dict(sorted(data(symbols=otp.Symbols(db='NYSE_TAQ', pattern='AAA%')).items()))
{'AAA': Time PRICE SIZE
0 2022-03-01 09:30:00.176313344 24.9700 2
1 2022-03-01 09:30:00.176313344 24.9700 2
2 2022-03-01 09:53:48.941710848 24.9427 3,
'AAAU': Time PRICE SIZE
0 2022-03-01 04:45:01.735280640 19.06 40
1 2022-03-01 04:59:51.663249920 19.05 1
2 2022-03-01 05:05:23.475195136 19.05 1049}
Note that the interval of the main query is implicitly used in otp.Symbols(db='NYSE_TAQ', pattern='AAA%')
.
The symbols
parameter can be any type of calculation that contains the SYMBOL_NAME
field in the resulting ticks.
Every tick provides a separate symbol name in the SYMBOL_NAME
field while the other fields are passed as
‘symbol parameters’.
Symbol parameters#
Symbol parameters are any fields other than SYMBOL_NAME
that are constant for an instrument. We illustrate this below.
First, consider a query that finds 2 most traded symbols among symbols starting with the letter ‘A’. We’ll later use the
output of this query to specify symbols and symbol params.
>>> trd = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2))
>>> trd = trd.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> count = otp.merge([trd], symbols=otp.Symbols(db='NYSE_TAQ', pattern='A%'), presort=True, identify_input_ts=True)
>>> most_traded = count.high('VOLUME', n=2)
>>> most_traded()
Time VOLUME SYMBOL_NAME TICK_TYPE
0 2022-03-02 209053345 ATXI TRD
1 2022-03-02 130093671 AMD TRD
The output of most_traded
provides symbols in the SYMBOL_NAME
column while the other columns (VOLUME
and TICK_TYPE
)
provide symbol parameters. The following code retrieves the value of VOLUME
for use in tick-by-tick analytics.
>>> trd = otp.DataSource('NYSE_TAQ', tick_type='TRD')
>>> trd = trd.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> count = otp.merge([trd], symbols=otp.Symbols('NYSE_TAQ', pattern='A%'), identify_input_ts=True)
>>> most_traded = count.high('VOLUME', n=2)
>>>
>>> data = otp.DataSource('NYSE_TAQ', tick_type='TRD')
>>> data = data.first(3)
>>> data = data[['PRICE','SIZE']]
>>> data['VOLUME'] = data.Symbol.VOLUME
>>> dict(sorted(otp.run(data, symbols=most_traded, start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2)).items()))
{'AMD': Time PRICE SIZE VOLUME
0 2022-03-01 04:00:00.033400871 123.60 274 130093671
1 2022-03-01 04:00:00.057311347 123.35 3 130093671
2 2022-03-01 04:00:00.057318115 123.35 1 130093671,
'ATXI': Time PRICE SIZE VOLUME
0 2022-03-01 04:00:00.052257108 0.6302 100 209053345
1 2022-03-01 04:00:01.140209439 0.5508 1000 209053345
2 2022-03-01 04:00:02.144151171 0.6000 100 209053345}
Equivalently, symbol parameters can be accessed by wrapping a function around the query as illustrated below.
>>> trd = otp.DataSource('NYSE_TAQ', tick_type='TRD')
>>> trd = trd.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> count = otp.merge([trd], symbols=otp.Symbols('NYSE_TAQ', pattern='A%'), identify_input_ts=True)
>>> most_traded = count.high('VOLUME', n=2)
>>>
>>> def query(sym):
... data = otp.DataSource('NYSE_TAQ', tick_type='TRD')
... data = data.first(3)
... data = data[['PRICE','SIZE']]
... data['VOLUME'] = sym['VOLUME']
... return data
>>>
>>> dict(sorted(otp.run(query, symbols=most_traded, start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2)).items()))
{'AMD': Time PRICE SIZE VOLUME
0 2022-03-01 04:00:00.033400871 123.60 274 130093671
1 2022-03-01 04:00:00.057311347 123.35 3 130093671
2 2022-03-01 04:00:00.057318115 123.35 1 130093671,
'ATXI': Time PRICE SIZE VOLUME
0 2022-03-01 04:00:00.052257108 0.6302 100 209053345
1 2022-03-01 04:00:01.140209439 0.5508 1000 209053345
2 2022-03-01 04:00:02.144151171 0.6000 100 209053345}
Time interval per symbol#
It is allowed to specify query interval per symbol using special fields _PARAM_START_TIME_NANOS
and _PARAM_START_TIME_NANOS
>>> custom_symbols = otp.Ticks(SYMBOL_NAME=['AAPL', 'MSFT'],
... _PARAM_START_TIME_NANOS=[otp.dt(2022, 3, 1, 10, 30), otp.dt(2022, 3, 1, 11)],
... _PARAM_END_TIME_NANOS=[otp.dt(2022, 3, 1, 11), otp.dt(2022, 3, 1, 11, 30)])
>>> data = otp.DataSource(db='NYSE_TAQ', tick_type='TRD')
>>> data = data.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> data['SYMBOL'] = data.Symbol.name
>>> dict(sorted(otp.run(data, start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2), symbols=custom_symbols).items()))
{'AAPL': Time VOLUME SYMBOL
0 2022-03-01 11:00:00 6014546 AAPL,
'MSFT': Time VOLUME SYMBOL
0 2022-03-01 11:30:00 1503972 MSFT}
Note that per symbol intervals should be inside the query interval.
Associated symbols#
Associated symbols is a technique when unbound symbols are used to define bound symbols or symbols in related queries.
It is expressed using the eval
and а symbol parameter. More details can be
found in the API doc for the eval