9
$\begingroup$

If I have data for different companies' revenue, and I want to train a GNN model to work for all companies, how should I split the data?

Imagine, for example, company 1 has 1000 data points over time, company 2 has 2000 data points over time, company 3 has only 100 data points, and so on.

What is/are the way to split the data for training a GNN model?

$\endgroup$

1 Answer 1

10
$\begingroup$

There are a few ways to handle this and they are dependent on some assumptions that you make on your data.

Treat all data equally

If you view all data as equally relevant and representing a similar process, you could just lump all data into one dataset and perform your splits from there.

Handle the imbalance

If you want to make sure that all companies are proportionally represented, you should then perform a train/test split for each company and then combine your train and test subsets. You could view this as a sort of stratified split.

Upsample the less represented data

Not typically something that has often worked for me in practice, but you can also upsample your less represented data points. For instance, all data points have a probability inversely proportional to how many data points there are in their particular subset. For instance, a data point from the company with 100 data points would have 10 times the chance to be sampled compared to one from the 1000 data points subset.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.