Schema#
Every source comes with fields and field types, i.e., the tick schema.
A developer could access and modify the schema using onetick.py.Source.schema of any source.
The schema for a data source looks like this:
>>> data = otp.DataSource(db='NYSE_TAQ', tick_type='TRD', symbols='AAPL', start=otp.dt(2022, 3, 1), end=otp.dt(2022, 3, 2))
>>> data.schema
 {'COND': string[4], 'CORR': <class 'int'>, 'DELETED_TIME': <class 'onetick.py.types.msectime'>, 'EXCHANGE': string[1], 'OMDSEQ': <class 'onetick.py.types.uint'>, 'PARTICIPANT_TIME': <class 'onetick.py.types.nsectime'>, 'PRICE': <class 'float'>, 'SEQ_NUM': <class 'int'>, 'SIZE': <class 'int'>, 'SOURCE': string[1], 'STOP_STOCK': string[1], 'TICKER': string[16], 'TICK_STATUS': <class 'onetick.py.types.short'>, 'TRADE_ID': string[20], 'TRF': string[1], 'TRF_TIME': <class 'onetick.py.types.nsectime'>, 'TTE': string[1]}
The schema is updated when a new column is added:
data = otp.Ticks({'X': [7, 4, 11] })
data['Y'] = data['X'] * 0.5
print(data.schema)
{'X': <class 'int'>, 'Y': <class 'float'>}
The following types are supported:
- int(maps to the- int64in OneTick)
- float(maps to the- doublein OneTick)
- onetick.py.nsectimefor nanosecond precision timestamps (the- nsectimein OneTick)
- onetick.py.msectimefor milliseconds precision timestamp (the- msectimein OneTick)
- strfor string not more than 64 characters (the- stringin OneTick)
- onetick.py.stringfor any kind of strings,- string[64]is equal to the- str(the- string[N]in OneTick)
Note
The onetick.py package doesn’t have the boolean type. Condition expressions use the float type with 1.0 and 0.0 values.
The schema allows being explicit about the fields available for analytics when the data is not available. This is important as the construction of the calculation graph is separate from the execution (i.e., from when the data becomes available). The ticks in a database could have different set of columns or types from what the code assumes.
By default the schema is deduced automatically for each source but this may not always work (and as you’ll see below we’ll recommend setting it explicitly anyway). For example, it is impossible to deduce a single schema for a data source when different symbols have different schemas which will lead to errors when updating a field that does not exist or adding a field that exists (note that the operations to add and modify fields look the same to the user data[‘X’] = 1)
Another example is when your logic aims to analyze trades and you set the schema to expect PRICE=float and SIZE=int but quotes are passed as a source leading to a runtime error.
Schema deduction mechanism#
The constructor of the onetick.py.DataSource class has the schema guessing / deduction mechanism
based on the passed db, tick_type, symbol(-s), start and end parameters.
It is convenient and fits well with the Jupyter style of code writing and with tests,
however it’s impossible to find the right schema in general. That is why
we strongly recommend to explicitly disable this mechanism and specify the schema manually for production cases.
The schema deduction takes place in the constructor of the onetick.py.DataSource,
and it is enabled by default when the db parameter is specified. There is ability to control this behaviour with the schema_policy parameter.
data = otp.DataSource(db='NYSE_TAQ',
                      tick_type='TRD',
                      symbol='AAPL',
                      schema_policy='manual')
data.schema.set(ASK_PRICE=float, BID_PRICE=float, ASK_SIZE=int, BID_SIZE=int)
manual means that the source object has an empty schema and it is expected that the schema will be set manually using the otp.Source.schema.set method. That is the recommended way for production code.
Fields not in the schema#
It is possible that the source has more fields than needed for the use case. Defining only the necessary fields in the
schema does not actually remove them: they are still propagated. However, the onetick.py.Source.table()
method can be used to propagate only the specified fields and to set the schema accordingly.
data = data.table(ASK_PRICE=float, BID_PRICE=float, ASK_SIZE=int, BID_SIZE=int)
onetick.py.Source.table() guarantees the fields will be present during runtime even if
the fields are not present in the data. In this case, a field is filled with the default value for the corresponding
field type.
Types change#
The field type can be modified. This is done implicitly when values of a different type are assigned to the field
data['X'] = 1              # it is the `int` type
data['X'] = data['X'] / 2  # here it becomes `float`
or it could be done explicitly using the onetick.py.Source.apply() method
(or equivalently – onetick.py.Source.astype())
data['X'] = data['X'].apply(str)