Synthetic data generation - Part 1

info8191790
Feb 29, 2024
4 min read

Introduction

It’s not uncommon to be asked to test performance of a system based on realistic future volumes. That naturally leads to the question of how to generate the test data. Let’s look for some solutions.

Let’s assume a FX futures broker wants to start a new business and these are the volumes they expect to generate in one year and in five years. Since it’s a new business it doesn’t have any transactions and we’ll have to generate the data for the test from scratch.

As you can see this is just for one currency for now. We’re omitting factors such as the type of

client which may be a factor in a real-life scenario.

If we look at the data differently, we can see the split of transaction volumes against the total and against the maturities. In our 1 year case above the proportions of transactions within the maturities is the same which will simplify the solution to start with:

For this paper we found a few open-source solutions. Here is a list of various tools for generating synthetic data: https://github.com/statice/awesome-synthetic-data

Most of these are frameworks which learn and project from existing data based on various machine learning techniques – we will cover this approach in some future post. In our case we’re looking at a problem of Day Zero data – i.e. we don’t have any data to learn from. One of the most popular libraries for generating synthetic data – SDV has a solution but it’s a commercial solution so not quite free: https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/dayzsynthesizer

There are a few solutions which will allow us to create Day Zero data.

https://github.com/test-performance/plaitpy - it’s a fairly small python package – it failed to work on the first try but after a bit of hacking we managed to get it to work – the link is to our forked version of plaitpy.
https://github.com/huda-lab/synner - this is a very nice solution with a great looking GUI where you can define your data and then generate it from the GUI. Miro Mannino (https://www.linkedin.com/in/miromannino/) who wrote it some time ago was very helpful in building it and setting it up.

Plaitpy python library

Plaitpy is a python library allowing us to write a description of the data in a yaml format. It has some great features which allow us to generate various types of distributions, pick items from predefined CSVs, etc. The documentation and examples are in the above git repo – especially https://github.com/test-performance/plaitpy/blob/master/README.md and https://github.com/test-performance/plaitpy/tree/master/templates.

For example to describe our first scenario with the equal distribution the yaml is just:


https://github.com/test-performance/plaitpy/blob/master/templates/finance/transaction_equal_distro.yaml

fields:
  currency:
    value: EUR/USD
  maturity:
    mixture:
      - value: 1Y
        weight: 28
      - value: 5Y
        weight: 14
      - value: 6M
        weight: 58
  _base:
        mixture:
          - value: 200000
            weight: 77
          - value: 1000000
            weight: 19
          - value: 10000000
            weight: 4
  amount:
     lambda: random.gauss(this._base * 1.1, this._base * 0.1)

The above takes advantage of the lambda function to generate a Gaussian distribution of the amounts. After generating the data we can compare the resulting distribution with the expected one:

This is a pretty close fit. So let’s make the challenge a bit harder. We will now want to generate data for the volumes in five years’ time – you can see what it is in the first table – it’s the last column.

If we look at the breakdown – we can see that the proportion of volumes within the category differs across categories:

This is how we can describe the data in the plaitpy yaml format:


https://github.com/test-performance/plaitpy/blob/master/templates/finance/transaction.yaml

fields:
  currency:
    value: EUR/USD
  maturity:
    mixture:
      - value: 1Y
        weight: 23
      - value: 5Y
        weight: 11
      - value: 6M
        weight: 66
  _base6M:
        mixture:
          - value: 200000
            weight: 96
          - value: 1000000
            weight: 3
          - value: 10000000
            weight: 1
  _base1Y:
        mixture:
          - value: 200000
            weight: 95
          - value: 1000000
            weight: 4
          - value: 10000000
            weight: 1
  _base5Y:
        mixture:
          - value: 200000
            weight: 48 
          - value: 1000000
            weight: 38
          - value: 10000000
            weight: 14
  _base:
    switch:  
      - onlyif: this.maturity == "6M"
        lambda: this._base6M
      - onlyif: this.maturity == "1Y"
        lambda: this._base1Y
      - onlyif: this.maturity == "5Y"
        lambda: this._base5Y
      - default: 
        value: 0  
  amount:
     lambda: random.gauss(this._base * 1.1, this._base * 0.1)

You will notice that the distribution is now split between the 1Y category (23% maturity weight section above) and base1Y weight section above set to 95%. This yaml is a bit more complicated but still quite easy to understand – it uses a sub-routine base which in turn calls subroutine _base1Y/6M/5Y based on the maturity field generated earlier.

Once we run this configuration and compare the data with the expected results we’ll see:

This is again a pretty close fit. What if there are more fields we need to generate? It depends on how they are related to the other fields:

a) If they are not dependent on another field it’s easy to add them – either as static values or as random or gaussian/other distributions using the lambda functions

b) If they are dependent on one or more fields – meaning they are calculated fields – this can easily be done via the lambda function

c) If they are at the top of the tree – like for example the currency in our case – it would be easy to just create a new yaml file for each of them and create a bit of python which would call the templates in a sequence and generate all the data across these template – let us know and we can share a version of this with you

d) If the field has different distribution from the other fields, let’s imagine for example a field for client type where most of the 10million trades are made with a big client but a proportion is made with small ones, this can in most cases be done using a combination of the subroutines and lambda functions.

Conclusion on Plaitpy

Pros:

Version control or the yaml files
Easily reproducible
Can be made part of CI/CD cycle

Cons:

It’s not very visual and the yaml needs some getting used to
The yaml structure is not self-explanatory

We'll follow up with Part 2 which will discuss an alternative tool - Synner

Synthetic data generation - Part 1

Introduction

Plaitpy python library

Conclusion on Plaitpy

Recent Posts

Comments

Subscribe to Our Newsletter