Synthetic data generation - Part 2

info8191790
Mar 11, 2024
2 min read

In our last post we covered synthetic data creation using Plaitpy - a small Python library. This time we'll cover another open-source tool we found to be very useful - Synner.

First a quick refresher of the problem. We assume that the client is an FX futures broker who wants to start a new business and these are the volumes they expect to generate in one year and in five years. Since it’s a new business it doesn’t have any transactions and we’ll have to generate the data for the test from scratch:

As you can see this is just for one currency for now. We’re omitting factors such as the type of client which may be a factor in a real-life scenario.

If we look at the data differently, we can see the split of transaction volumes against the total and against the maturities. In our 1 year case above the proportions of transactions within the maturities is the same which will simplify the solution to start with:

Synner

This is a great tool with nice UI. You will need a bit of technical skill to build it. We got in touch with the author - Miro Mannino (https://www.linkedin.com/in/miromannino/) who was super helpful with the building of it. Thanks!

If you need any help with building it, please let us know at info@testperformance.co.uk. Once it’s running you can define the distribution of the amounts across the various amount brackets mentioned at the beginning – see the _above column defined below:

Once we have that, we can define the actual distribution, we’ll choose a gaussian distribution conditional on the _above variable we’ve just defined:

This is a nice basic scenario and once we generate the data we’ll get a very close match with the desired distribution. (NB Don’t get confused by the 100,000 instead of 200,000 – we’ve rounded the orders of magnitude in the results data in Excel to make it simpler to bracket the data)

We can now make it more complex by adding more dependencies eg make above column dependent on the maturity so that if Maturity = 1Y the split for the _above column is different from 5Y and 6M.

In Synner, you can either generate the data directly from the user interface or you can download the model (in json format) and run it from the command line. The model file is a bit long to show here so here is an example under this link: https://github.com/test-performance/synner/tree/master/docs

In this case for our overall distribution in five years, the data from Synner was also a very close match to our desired distribution: