I’m training a DL model on some data. Due to size constraints, I don’t feed the model the raw data. I preprocess and filter, reducing files down by a factor between a hundred and a thousand. So 10 million data points become 10 thousand or a 10×10,000 array, depending on specifics.

The upsides are:

1) Increased Accuracy

When I have a fair idea what I’m looking for.

2) Learning Something

A feature that works significantly better or worse than expected tells me something about the signal to noise ratio or inherent physics.

For example, I spend a lot of time in the time domain. I sample at varying speeds, often in the megahertz regime. Deep learning, and computers in general, only know what they are told, and that means if I’m using two samples as a features, V1 at time t1 and V2 at time t2, the model can’t explore relationships. It can weight them, but it cannot calculate V1/V2 and weight that, nor ln(V1) and ln(V2). It can if I create a layer that does that, but for exploratory purposes, it’s easier for me to calculate ln(V1) and use that as a feature than create a model that explores ln(V1).

What that means is as I feed features to this model, I’m exploring what works. When I find out that ln(V1) is a great feature but ln(V2) isn’t, that tells me something useful. That way I gain some insight into the functioning of the hidden physics, Plato’s puppet masters, and they’re as interesting as developing tools to expose them.

The downside is it’s hard to explore. It takes forever. This is why science is hard.

3) Reduced Computation Time

Depending on how I run the model, the cost of a calculation trends with the square of data size. Cost is a nebulous function of time and processing power. That factor of 1,000 from above, or even 100, often results in a 1/1,000,000 factor of computation time. That’s huge. That’s near the high-side, but reasonable time reduction factors of 1/100 total computation times are accessible.

I haven’t really started exploring this yet. I probably should.