Particle filters for limit order books: my Cambridge Master's thesis

My final-year Master's project, for my degree in Information and Computer Engineering at Cambridge University, focused on discovering new and improved sequential Monte Carlo filters for use in high-frequency trading.

Full text: PDF     Code: GitHub

Title: Simulation and inference for limit order books in high-frequency finance

Abstract

This project aims to modify and improve filtering and prediction methods used for analysing time-series financial data, in order to achieve superior filtering performance compared with the previous state of the art. Improvements to these methods are beneficial not only to trading algorithms but across engineering, in problems as wide as robotics and forecasting. The focus here, however, is on the limit order book: the flow of buy and sell orders that drives the market.

A stochastic model for generating limit order book data is derived, where trade data is treated as a hidden Markov model with additional nonlinear Poisson-distributed jumps present. The Kalman and particle filters are extended to utilise additional information from the limit order book, in order to most accurately detect these jumps and the price momentum. Modifications to the original particle filter, and to previous work incorporating order volume data, include: an additional trend term in the state equation; resetting the trend means and covariances on jumps; exploring weighted functions of order volumes; and combining these methods into an overall best model.

This model is implemented, and tests conducted on low-frequency and high-frequency trade data to assess both empirical and statistical metrics. These include mean square error, binary prediction accuracy, log-likelihood and inverse CDF sampling over multiple time frames, to best approximate performance in a real trade system. Consideration is also given to the challenges faced in applying these methods to very high-frequency data, and how they may be addressed. Methods are explored for optimising hyperparameters against changing real-time data, and for implementing these algorithms in an efficient and scalable way conducive to real-world use.

Performance improvements are demonstrated with the new combined model over previous models, and its versatility is shown by its ability to track jumps in high-frequency data which exhibit complex nonlinear dynamics. These results are promising for future applications of these methods in work on limit order book inference and more generally in time series analysis.

Author notes

This thesis (technically called a project final report) is the culmination of my final year's work at Cambridge University Engineering Department, completed with the help of the Signal Processing and Communications Laboratory. While my degree makes me an engineer by training, and I spent the first two years studying everything from structural mechanics to thermofluids, my final two years focussed on signals and systems, statistical machine learning and general computer science concepts. As a result, I wanted to apply skills from these areas to my final project, and the area of high-frequency finance is ripe for the application of such scientific computing methods.

I'm quite pleased with the results we found. Not only did the project teach me a huge amount about working with Python in a more complex environment than just simple scripts, it solidified my respect for classical statistics as the bedrock for almost every modern inference task. This is important to remember as quantitative finance is an area where stochastic calculus and Bayesian filtering are still in heavy use, and probably will be for some time, despite the pace of development in deep learning.

What the code does

Included in the published code is a module called model which contains the complete particle filter. Import it with import model.filter_wrapper as fltr. By default the hyper-parameters will be initialised so that the code runs but not produces usable results; they must be tuned depending on the qualities of the data being used. Initialise a particle filter with filter = fltr.FilterWrapper(filter_type, params), where filter_type is a string defining the sort of filter to be used, e.g. 'extended-volume-reset', and params is a dictionary containing any modified hyper-parameters.

The code requires data to be in the form of a gzipped Pandas DataFrame. To run the filter with a certain data file, use filter.run(file, 'snp'), where file is a string with the location of the data and 'snp' defines the data as being of the snapshot type. Unfortunately, I can't share the data I used to generate the results in the report, but included with the code is a DataFrame, sample.pd.gz to show the column headings and data types needed with a sample row of data (note that more rows of data must be used for the filter to run properly). Once the filter has completed evaluation, some basic statistics can be returned with stats = filter.stats().

There is much more to the code, including more functions, and options for those described above, all of which should be documented in model/filter_wrapper.py.

A few pointers on formatting and working with LaTeX

I spent probably more time than was sensible perfecting the formatting of my thesis. Here are a few notes and tips for working with LaTeX in an academic setting, which may or may not save others some time (and hair-pulling) in future:

  • TeXstudio is my IDE of choice; I've been using it for several years and have yet to find anything that feels more complete or intuitive, with a helpful editor and preview in a single window.
  • TeXcount was particularly useful for tallying up a reasonably accurate word count over multiple .tex files that are included in a master file (definitely do this, it makes organising large documents so much easier!). A quick total word count for the entire thesis can be found with: texcount thesis.tex -sum -inc.
  • Always be sure to \usepackage{microtype}. It makes text flow better over line breaks and ensures the most efficient use of space on the page.
  • Note the difference between TeX's various methods for line-spacing, as it can lead to serious confusion when you wonder why your '1.5x' spacing is not actually that.
  • The cleveref package makes it much easier to be consistent in your use of Fig., Figure and figure, by eliminating the need to write them out explicitly every time you reference something.
  • Use git with your LaTeX! It's code like any other, and putting it in a repository allows you to track changes over time, roll back in case of accidental deletion, and save snapshots for draft versions. Consider pushing your repo to a free, private remote like Bitbucket or Gitlab as a simple backup. Alternatively, the online editor Overleaf, which many of my colleagues used to good effect, uses git behind the scenes for its versioning.