Schema#
Every source comes with fields and field types, i.e., the tick schema.
A developer could access and modify the schema using onetick.py.Source.schema
of any source.
The schema for a data source looks like this:
>>> data = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', symbols='AAPL', start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2))
>>> data.schema
{'COND': string[4], 'CORR': <class 'onetick.py.types._int'>, 'DELETED_TIME': <class 'onetick.py.types.msectime'>, 'EXCHANGE': string[1], 'OMDSEQ': <class 'onetick.py.types.uint'>, 'PARTICIPANT_TIME': <class 'onetick.py.types.nsectime'>, 'PRICE': <class 'float'>, 'SEQ_NUM': <class 'int'>, 'SIZE': <class 'int'>, 'SOURCE': string[1], 'STOP_STOCK': string[1], 'TICKER': string[16], 'TICK_STATUS': <class 'onetick.py.types._int'>, 'TRADE_ID': string[20], 'TRF': string[1], 'TRF_TIME': <class 'onetick.py.types.nsectime'>, 'TTE': string[1]}
The schema is updated when a new column is added:
data = otp.Ticks({'X': [7, 4, 11] })
data['Y'] = data['X'] * 0.5
print(data.schema)
{'X': <class 'int'>, 'Y': <class 'float'>}
Here is the list of some of the supported types (see full list in Types):
int
(maps to theint64
in OneTick)float
(maps to thedouble
in OneTick)onetick.py.nsectime
for nanosecond precision timestamps (thensectime
in OneTick)onetick.py.msectime
for milliseconds precision timestamp (themsectime
in OneTick)str
for string not more than 64 characters (thestring
in OneTick)onetick.py.string
for any kind of strings,string[64]
is equal to thestr
(thestring[N]
in OneTick)
Note
The onetick.py package doesn’t have the boolean type. Condition expressions use the float type with 1.0 and 0.0 values.
The schema allows being explicit about the fields available for analytics when the data is not available. This is important as the construction of the calculation graph is separate from the execution (i.e., from when the data becomes available). The ticks in a database could have different set of columns or types from what the code assumes.
By default the schema is deduced automatically for each source but this may not always work (and as you’ll see below we’ll recommend setting it explicitly anyway). For example, it is impossible to deduce a single schema for a data source when different symbols have different schemas which will lead to errors when updating a field that does not exist or adding a field that exists (note that the operations to add and modify fields look the same to the user data[‘X’] = 1)
Another example is when your logic aims to analyze trades and you set the schema to expect PRICE=float and SIZE=int but quotes are passed as a source leading to a runtime error.
Schema deduction mechanism#
onetick.py.DataSource
is the main interface to read from the OneTick database.
This is one of the few places where onetick-py
can’t easily get the schema of the data,
so this class has the schema guessing / deduction mechanism
based on the passed db, tick_type, symbol(-s), start and end parameters.
It is convenient and fits well with the Jupyter style of code writing and with tests, however this mechanism does not guarantee calculating correct schema in all cases and it also requires making an additional query to get the schema, thus affecting performance of the query generation.
That is why we strongly recommend to explicitly disable this mechanism and specify the schema manually for production cases.
The schema deduction takes place in the constructor of the onetick.py.DataSource
,
and it is enabled by default when the db parameter is specified.
It is possible to control this behaviour with the schema_policy parameter.
data = otp.DataSource(db='NYSE_TAQ',
tick_type='QTE',
symbol='AAPL',
schema_policy='manual')
data.schema.set(ASK_PRICE=float, BID_PRICE=float, ASK_SIZE=int, BID_SIZE=int)
schema_policy=manual means that the source object has an empty schema
and it is expected that the schema will be set manually by the user
using the onetick.py.Source.schema
methods.
That is the recommended way for production code.
Fields not in the schema#
It is possible that the source has more fields than needed for the use case. Defining only the necessary fields in the
schema does not actually remove them: they are still propagated. However, the onetick.py.Source.table()
method can be used to propagate only the specified fields and to set the schema accordingly.
data = data.table(ASK_PRICE=float, BID_PRICE=float, ASK_SIZE=int, BID_SIZE=int)
onetick.py.Source.table()
guarantees the fields will be present during runtime even if
the fields are not present in the data. In this case, a field is filled with the default value for the corresponding
field type.
Types change#
The field type can be modified. This is done implicitly when values of a different type are assigned to the field
data['X'] = 1 # it is the `int` type
data['X'] = data['X'] / 2 # here it becomes `float`
or it could be done explicitly using the onetick.py.Source.apply()
method
(or equivalently – onetick.py.Source.astype()
)
data['X'] = data['X'].apply(str)