Databases, symbols, and tick types#

The data is organized by database (e.g., NYSE_TAQ or CME), symbol (e.g., AAPL), and tick type (e.g., TRD or QTE). A data source must specify these (possibly providing multiple values for each) to retrieve data.

A tick type can be set explicitly via the tick_type parameter of a data source.

The database name can be specified via the db parameter or as part of a symbol using the :: separator. For example

>>> trades = otp.DataSource(db='NYSE_TAQ', symbol='AAPL', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> volume()
         Time    VOLUME
 0 2022-03-02  90604833

is equivalent to

>>> trades = otp.DataSource(symbol='NYSE_TAQ::AAPL', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> trades.count()
749660

The same technique applies to the database and tick type:

>>> trades = otp.DataSource(symbol='AAPL', tick_type='NYSE_TAQ::TRD', date=otp.dt(2022, 3, 1))
>>> trades.count()
749660

Symbols: bound and unbound#

There are two ways of setting symbols: bound and unbound.

Bound symbols are specified when defining a source or with merge. Unbound symbols are set when the query is executed and apply to all sources that do not specify bound symbols. Here is an simple example of an unbound symbol:

>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> volume(symbols=['AAPL'])  # unbound symbol is set here
         Time    VOLUME
 0 2022-03-02  90604833

There is no difference between bound and unbound when we talk about a single symbol. To appreciate the difference, let’s look at two symbols.

>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> dict(sorted(volume(symbols=['AAPL', 'MSFT']).items()))
 {'AAPL':         Time    VOLUME
  0 2022-03-02  90604833,
  'MSFT':         Time    VOLUME
  0 2022-03-02  35035286}

The results for each unbound symbol are processed separately and returned in a separate pandas DataFrame. The result of the run method above is a dict with symbols as keys and pandas DataFrames as values.

Contrast this with bound symbols where the ticks are merged into a single flow:

>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1), symbols=['AAPL', 'MSFT'])
>>> volume = trades.agg({'VOLUME': otp.agg.sum(trades['SIZE'])})
>>> volume()
         Time     VOLUME
 0 2022-03-02  125640119

Specifying bound symbols on a source is just a shorthand for the merge method added right after the source definition.

>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> cross_symbol_trades = otp.merge([trades], symbols=['AAPL', 'MSFT'])
>>> volume = cross_symbol_trades.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> volume()
        Time     VOLUME
0 2022-03-02  125640119

Concurrency#

Unbound symbols can be processed in parallel by specifying the concurrency parameter of otp.run. You have to explicitly use otp.run to execute the query if you want to specify concurrency.

otp.run(volume, symbols=['AAPL', 'MSFT', 'AMZN', 'META'], concurrency=8)

For bound symbols, all calculations passed into merge (trades in example above) run in parallel. Calculations can be made faster by computing as much as possible per symbol before merging them. The following example has the same result as the previous one but it finds the total volume faster as it calculates the volume for every symbol independently which can be done in parallel and then adds up the total.

>>> trades = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', date=otp.dt(2022, 3, 1))
>>> volume = trades.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> cross_symbol_volumes = otp.merge([volume], symbols=['AAPL', 'MSFT'], presort=True, identify_input_ts=True)
>>> otp.run(cross_symbol_volumes, concurrency=8)
        Time    VOLUME SYMBOL_NAME TICK_TYPE
0 2022-03-02  90604833        AAPL       TRD
1 2022-03-02  35035286        MSFT       TRD

>>> total_volume = cross_symbol_volumes.agg({'TOTAL_VOLUME': otp.agg.sum('VOLUME')})
>>> otp.run(total_volume, concurrency=8)
        Time  TOTAL_VOLUME
0 2022-03-02     125640119

Note

All sources passed into merge are considered bound symbol sources.

Symbols#

Specifying symbols dynamically#

In many cases it is necessary to select symbols from a databases according to some logic. The Symbols source makes this easy.

>>> symbols = otp.Symbols(db='NYSE_TAQ', date=otp.dt(2022,8,17), pattern='AAA%')
>>> symbols()
        Time SYMBOL_NAME
0 2022-08-17         AAA
1 2022-08-17        AAAU

The result of Symbols can be used as bound/unbound symbols.

>>> data = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2))
>>> data = data.first(3)
>>> data = data[['PRICE','SIZE']]
>>> dict(sorted(data(symbols=otp.Symbols(db='NYSE_TAQ', pattern='AAA%')).items()))
{'AAA':                            Time    PRICE  SIZE
 0 2022-03-01 09:30:00.176313344  24.9700     2
 1 2022-03-01 09:30:00.176313344  24.9700     2
 2 2022-03-01 09:53:48.941710848  24.9427     3,
 'AAAU':                            Time  PRICE  SIZE
 0 2022-03-01 04:45:01.735280640  19.06    40
 1 2022-03-01 04:59:51.663249920  19.05     1
 2 2022-03-01 05:05:23.475195136  19.05  1049}

Note that the interval of the main query is implicitly used in otp.Symbols(db='NYSE_TAQ', pattern='AAA%').

The symbols parameter can be any type of calculation that contains the SYMBOL_NAME field in the resulting ticks. Every tick provides a separate symbol name in the SYMBOL_NAME field while the other fields are passed as ‘symbol parameters’.

Symbol parameters#

Symbol parameters are any fields other than SYMBOL_NAME that are constant for an instrument. We illustrate this below. First, consider a query that finds 2 most traded symbols among symbols starting with the letter ‘A’. We’ll later use the output of this query to specify symbols and symbol params.

>>> trd = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2))
>>> trd = trd.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> count = otp.merge([trd], symbols=otp.Symbols(db='NYSE_TAQ', pattern='A%'), presort=True, identify_input_ts=True)
>>> most_traded = count.high('VOLUME', n=2)
>>> most_traded()
        Time     VOLUME SYMBOL_NAME TICK_TYPE
 0 2022-03-02  209053345        ATXI       TRD
 1 2022-03-02  130093671         AMD       TRD

The output of most_traded provides symbols in the SYMBOL_NAME column while the other columns (VOLUME and TICK_TYPE) provide symbol parameters. The following code retrieves the value of VOLUME for use in tick-by-tick analytics.

>>> trd = otp.DataSource('NYSE_TAQ', tick_type='TRD')
>>> trd = trd.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> count = otp.merge([trd], symbols=otp.Symbols('NYSE_TAQ', pattern='A%'), identify_input_ts=True)
>>> most_traded = count.high('VOLUME', n=2)
>>>
>>> data = otp.DataSource('NYSE_TAQ', tick_type='TRD')
>>> data = data.first(3)
>>> data = data[['PRICE','SIZE']]
>>> data['VOLUME'] = data.Symbol.VOLUME
>>> dict(sorted(otp.run(data, symbols=most_traded, start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2)).items()))
 {'AMD':                            Time   PRICE  SIZE     VOLUME
  0 2022-03-01 04:00:00.033400871  123.60   274  130093671
  1 2022-03-01 04:00:00.057311347  123.35     3  130093671
  2 2022-03-01 04:00:00.057318115  123.35     1  130093671,
  'ATXI':                            Time   PRICE  SIZE     VOLUME
  0 2022-03-01 04:00:00.052257108  0.6302   100  209053345
  1 2022-03-01 04:00:01.140209439  0.5508  1000  209053345
  2 2022-03-01 04:00:02.144151171  0.6000   100  209053345}

Equivalently, symbol parameters can be accessed by wrapping a function around the query as illustrated below.

>>> trd = otp.DataSource('NYSE_TAQ', tick_type='TRD')
>>> trd = trd.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> count = otp.merge([trd], symbols=otp.Symbols('NYSE_TAQ', pattern='A%'), identify_input_ts=True)
>>> most_traded = count.high('VOLUME', n=2)
>>>
>>> def query(sym):
...     data = otp.DataSource('NYSE_TAQ', tick_type='TRD')
...     data = data.first(3)
...     data = data[['PRICE','SIZE']]
...     data['VOLUME'] = sym['VOLUME']
...     return data
>>>
>>> dict(sorted(otp.run(query, symbols=most_traded, start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2)).items()))
 {'AMD':                            Time   PRICE  SIZE     VOLUME
  0 2022-03-01 04:00:00.033400871  123.60   274  130093671
  1 2022-03-01 04:00:00.057311347  123.35     3  130093671
  2 2022-03-01 04:00:00.057318115  123.35     1  130093671,
  'ATXI':                            Time   PRICE  SIZE     VOLUME
  0 2022-03-01 04:00:00.052257108  0.6302   100  209053345
  1 2022-03-01 04:00:01.140209439  0.5508  1000  209053345
  2 2022-03-01 04:00:02.144151171  0.6000   100  209053345}

Time interval per symbol#

It is allowed to specify query interval per symbol using special fields _PARAM_START_TIME_NANOS and _PARAM_START_TIME_NANOS

>>> custom_symbols = otp.Ticks(SYMBOL_NAME=['AAPL', 'MSFT'],
...                            _PARAM_START_TIME_NANOS=[otp.dt(2022, 3, 1, 10, 30), otp.dt(2022, 3, 1, 11)],
...                            _PARAM_END_TIME_NANOS=[otp.dt(2022, 3, 1, 11), otp.dt(2022, 3, 1, 11, 30)])
>>> data = otp.DataSource(db='NYSE_TAQ', tick_type='TRD')
>>> data = data.agg({'VOLUME': otp.agg.sum('SIZE')})
>>> data['SYMBOL'] = data.Symbol.name
>>> dict(sorted(otp.run(data, start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2), symbols=custom_symbols).items()))
 {'AAPL':                  Time  VOLUME SYMBOL
 0 2022-03-01 11:00:00   6014546   AAPL,
 'MSFT':                  Time  VOLUME SYMBOL
 0 2022-03-01 11:30:00    1503972   MSFT}

Note that per symbol intervals should be inside the query interval.

Associated symbols#

Associated symbols is a technique when unbound symbols are used to define bound symbols or symbols in related queries. It is expressed using the eval and а symbol parameter. More details can be found in the API doc for the eval