Offline Reinforcement Learning offers the promise of leveraging off-the-shelf datasets to learn policies, especially when exploration is costly or impractical. However, the generalization of policies is sometimes hampered by spurious correlations in the offline data. To address this, we integrate causal discovery into a low-rank Markov Decision Process (MDP) framework to reduce such spurious correlation. We propose Bilinear Causal wOrld Modeling~(BiCOM), an algorithm that uses low-rank MDP to capture causal transition dynamics for generalizing offline RL algorithms. Recognizing the scarcity of benchmarks in offline causal RL, we design a decision-making benchmark with spurious correlation. Empirical evaluations over 18 tasks with different data quality demonstrate the superior performance of BiCOM over existing offline RL algorithms in online deployments. Complementing the empirical findings, we also provide a theoretical analysis of BiCOM’s finite-sample guarantees under structure awareness and pessimism, reassuring the effectiveness and efficiency of incorporating low-rank MDP in the offline setting.
Supplementary notes can be added here, including code and math.