Split data in different units for GNN

Question

If I have data for different companies' revenue, and I want to train a GNN model to work for all companies, how should I split the data?

Imagine, for example, company 1 has 1000 data points over time, company 2 has 2000 data points over time, company 3 has only 100 data points, and so on.

What is/are the way to split the data for training a GNN model?

Valentin Calomme · Accepted Answer · 2026-04-23 08:54:40Z

There are a few ways to handle this and they are dependent on some assumptions that you make on your data.

Treat all data equally

If you view all data as equally relevant and representing a similar process, you could just lump all data into one dataset and perform your splits from there.

Handle the imbalance

If you want to make sure that all companies are proportionally represented, you should then perform a train/test split for each company and then combine your train and test subsets. You could view this as a sort of stratified split.

Upsample the less represented data

Not typically something that has often worked for me in practice, but you can also upsample your less represented data points. For instance, all data points have a probability inversely proportional to how many data points there are in their particular subset. For instance, a data point from the company with 100 data points would have 10 times the chance to be sampled compared to one from the 1000 data points subset.

Stack Exchange Network

Split data in different units for GNN

1 Answer 1

Treat all data equally

Handle the imbalance

Upsample the less represented data

Your Answer

Hot Network Questions

Split data in different units for GNN

1 Answer 1

Treat all data equally

Handle the imbalance

Upsample the less represented data

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions