Multivariate and Deep Learning Interview Questions
Q: How do you model multivariate time series?
Short interview answer
If the variables influence each other, I need a model that captures cross-series dependence, not just separate univariate forecasts.
Classical choices
- VAR for stationary multivariate dependence
- VARMAX when exogenous variables are present
- VECM when the series are cointegrated
- State-space models for latent-dynamics formulations
VAR formula
1
y_t = c + A_1 y_(t-1) + A_2 y_(t-2) + ... + A_p y_(t-p) + ε_t
Where:
y_tis now a vector- each
A_iis a coefficient matrix ε_tis a multivariate innovation term
Q: What is cointegration, and when do you use VECM?
Short interview answer
Cointegration means several non-stationary series move together so that some linear combination of them is stationary. In that case, differencing everything with a plain VAR can lose long-run equilibrium structure, so VECM is more appropriate.
Core idea
If x_t and z_t are each I(1) but:
1
u_t = x_t - β z_t
is stationary, then they are cointegrated.
VECM form
1
Δ y_t = Π y_(t-1) + Σ_(i=1 to p-1) Γ_i Δ y_(t-i) + ε_t
The matrix Π captures long-run equilibrium adjustment.
Q: Explain the vanishing gradient problem in RNNs.
Short interview answer
In backpropagation through time, gradients are repeatedly multiplied by Jacobian terms. If those terms have magnitude smaller than one, the gradient shrinks exponentially as we move backward in time.
Simple intuition
1
∂L/∂h_t = ∂L/∂h_T × Π_(k=t+1 to T) ∂h_k/∂h_(k-1)
If many factors in that product are small, early states receive almost no learning signal.
Q: How does LSTM address this problem?
Short interview answer
LSTM introduces a cell state and gating so the network can preserve information through additive updates rather than only repeated multiplicative shrinkage.
Core equations
1
2
3
4
5
6
7
f_t = σ(W_f [h_(t-1), x_t] + b_f)
i_t = σ(W_i [h_(t-1), x_t] + b_i)
o_t = σ(W_o [h_(t-1), x_t] + b_o)
g_t = tanh(W_g [h_(t-1), x_t] + b_g)
c_t = f_t ⊙ c_(t-1) + i_t ⊙ g_t
h_t = o_t ⊙ tanh(c_t)
What to say in interviews
- The forget gate controls memory retention.
- The input gate controls how much new information is written.
- The output gate controls exposure of the hidden state.
Q: What are time-series foundation models, and how are they different from training from scratch?
Strong answer
Time-series foundation models are pre-trained on many datasets or tasks and then adapted or used zero-shot on new series.
Good comparison
- Training from scratch learns only from one task or one dataset.
- Foundation models start with reusable representations and may transfer better across domains.
- They are most attractive when labeled data is limited or when you need strong cold-start baselines.
Examples to mention
- Chronos
- TimesFM
- MOMENT
- MOIRAI
- TimeGPT
Q: What is zero-shot forecasting, and when would you trust it?
Strong answer
Zero-shot forecasting means applying a pre-trained model to a new series without task-specific fine-tuning.
I trust it more when:
- the target domain resembles the pretraining distribution
- the forecast horizon is moderate
- backtests on a holdout slice are stable
- it beats simple baselines such as seasonal naive and AutoARIMA
I do not trust it blindly in highly regulated, sparse, or distribution-shifted settings without backtesting.