# transdim **Repository Path**: ccfgtt/transdim ## Basic Information - **Project Name**: transdim - **Description**: Machine learning for transportation data imputation and prediction. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-10-27 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # transdim [](https://opensource.org/licenses/MIT)  [](https://github.com/xinychen/transdim/archive/master.zip) [](https://github.com/xinychen/transdim)
Figure 1: Two classical missing patterns in a spatiotemporal setting.
We create three missing data mechanisms on real-world data. - **Missing data imputation** 🔥 - Random missing (RM): Each sensor lost observations at completely random. (★★★) - Non-random missing (NM): Each sensor lost observations during several days. (★★★★) - Blockout missing (BM): All sensors lost their observations at several consecutive time points. (★★★★)
Figure 2: Tensor completion framework for spatiotemporal missing traffic data imputation.
- **Spatiotemporal prediction** 🔥 - Forecasting without missing values. (★★★) - Forecasting with incomplete observations. (★★★★★)
Figure 3: Illustration of our proposed Low-Rank Autoregressive Tensor Completion (LATC) imputer/predictor with a prediction window τ (green nodes: observed values; white nodes: missing values; red nodes/panel: prediction; blue panel: training data to construct the tensor).
Implementation -------------- ### Open data In this repository, we have adapted the publicly available data sets into our experiments. If you want to view or use these data sets, please download them at the [../datasets/](https://github.com/xinychen/transdim/tree/master/datasets) folder in advance, and then run the following codes in your Python console: ```python import scipy.io tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat') tensor = tensor['tensor'] random_matrix = scipy.io.loadmat('../datasets/Guangzhou-data-set/random_matrix.mat') random_matrix = random_matrix['random_matrix'] random_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/random_tensor.mat') random_tensor = random_tensor['random_tensor'] ``` If you want to view the original data, please check out the following links: - **Gdata**: [Guangzhou urban traffic speed data set](https://doi.org/10.5281/zenodo.1205228). - **Bdata**: [Birmingham parking data set](https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham). - **Hdata**: [Hangzhou metro passenger flow data set](https://doi.org/10.5281/zenodo.3145403). - **Sdata**: [Seattle freeway traffic speed data set](https://github.com/zhiyongc/Seattle-Loop-Data). - **Ndata**: [NYC taxi data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). In particular, we take into account large-scale traffic data imputation/prediction on PeMS-4W and PeMS-8W data sets: - **PeMS-4W/8W/12W**: [Large-scale traffic speed data sets in California, USA](https://doi.org/10.5281/zenodo.3939792). You can download the data sets from [Zenodo](https://doi.org/10.5281/zenodo.3939792) and place them at the folder of datasets (data path example: `../datasets/California-data-set/pems-4w.csv`). Then you can open data in Python by using `Pandas`: ```python import pandas as pd data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None) ``` For model evaluation, we mask certain entries of the "observed" data as missing values and then perform imputation for these "missing" values. ### Model implementation In our experiments, we have implemented some machine learning models mainly on `Numpy`, and written these Python codes with **Jupyter Notebook**. So, if you want to evaluate these models, please download and run these notebooks directly (prerequisite: **download the data sets** in advance). - **Our models** | Task | Jupyter Notebook | Gdata | Bdata | Hdata | Sdata | Ndata | | :---------------------: | :----------------------------------------------------------- | :---: | :---: | :---: | :---: | :---: | | Missing Data Imputation | [**BTMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BTMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BGCP**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BGCP.ipynb) | ✅ | ✅ | ✅ | ✅ | ✅ | | | [**LRTC-TNN**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-LRTC-TNN.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BTTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BTTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | Single-Step Prediction | [**BTMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-ST-Online-BTMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BTTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-ST-Online-BTTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | Multi-Step Prediction | [**BTMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-Multi-BTMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BTTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-Multi-BTTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | - **Baselines** | Task | Jupyter Notebook | Gdata | Bdata | Hdata | Sdata | Ndata | | :---------------------: | :----------------------------------------------------------- | :---: | :---: | :---: | :---: | :---: | | Missing Data Imputation | [**BayesTRMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BayesTRMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**TRMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-TRMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BPMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BPMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**HaLRTC**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-HaLRTC.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**TF-ALS**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-TF-ALS.ipynb) | ✅ | ✅ | ✅ | ✅ | ✅ | | | [**BayesTRTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BayesTRTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | | [**BPTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-BPTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | Single-Step Prediction | [**BayesTRMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-ST-Online-BayesTRMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**TRMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-ST-Online-TRMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BayesTRTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-ST-Online-BayesTRTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | | [**TRTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-ST-Online-TRTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | Multi-Step Prediction | [**BayesTRMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-Multi-BayesTRMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**TRMF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-Multi-TRMF.ipynb) | ✅ | ✅ | ✅ | ✅ | 🔶 | | | [**BayesTRTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-Multi-BayesTRTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | | | [**TRTF**](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Prediction-Multi-TRTF.ipynb) | 🔶 | 🔶 | 🔶 | 🔶 | ✅ | * ✅ — Cover * 🔶 — Does not cover * 🚧 — Under development ### Imputation/Prediction performance - **Imputation example (on Guangzhou data)**  *(a) Time series of actual and estimated speed within two weeks from August 1 to 14.*  *(b) Time series of actual and estimated speed within two weeks from September 12 to 25.* > The imputation performance of BGCP (CP rank r=15 and missing rate α=30%) under the fiber missing scenario with third-order tensor representation, where the estimated result of road segment #1 is selected as an example. In the both two panels, red rectangles represent fiber missing (i.e., speed observations are lost in a whole day). - **Prediction example**    Quick Start -------------- This is an imputation example of Low-Rank Tensor Completion with Truncated Nuclear Norm minimization (LRTC-TNN). One notable thing is that unlike the complex equations in our paper, our Python implementation is extremely easy to work with. - First, import some necessary packages: ```python import numpy as np from numpy.linalg import inv as inv ``` - Define the operators of tensor unfolding (`ten2mat`) and matrix folding (`mat2ten`) using `Numpy`: ```python def ten2mat(tensor, mode): return np.reshape(np.moveaxis(tensor, mode, 0), (tensor.shape[mode], -1), order = 'F') ``` ```python def mat2ten(mat, tensor_size, mode): index = list() index.append(mode) for i in range(tensor_size.shape[0]): if i != mode: index.append(i) return np.moveaxis(np.reshape(mat, list(tensor_size[index]), order = 'F'), 0, mode) ``` - Define Singular Value Thresholding (SVT) for Truncated Nuclear Norm (TNN) minimization: ```python def svt_tnn(mat, alpha, rho, theta): tau = alpha / rho [m, n] = mat.shape if 2 * m < n: u, s, v = np.linalg.svd(mat @ mat.T, full_matrices = 0) s = np.sqrt(s) idx = np.sum(s > tau) mid = np.zeros(idx) mid[:theta] = 1 mid[theta:idx] = (s[theta:idx] - tau) / s[theta:idx] return (u[:,:idx] @ np.diag(mid)) @ (u[:,:idx].T @ mat) elif m > 2 * n: return svt_tnn(mat.T, tau, theta).T u, s, v = np.linalg.svd(mat, full_matrices = 0) idx = np.sum(s > tau) vec = s[:idx].copy() vec[theta:] = s[theta:] - tau return u[:,:idx] @ np.diag(vec) @ v[:idx,:] ``` - Define performance metrics (i.e., RMSE, MAPE): ```python def compute_rmse(var, var_hat): return np.sqrt(np.sum((var - var_hat) ** 2) / var.shape[0]) def compute_mape(var, var_hat): return np.sum(np.abs(var - var_hat) / var) / var.shape[0] ``` - Define LRTC-TNN: ```python def LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter): """Low-Rank Tenor Completion with Truncated Nuclear Norm, LRTC-TNN.""" dim = np.array(sparse_tensor.shape) pos_missing = np.where(sparse_tensor == 0) pos_test = np.where((dense_tensor != 0) & (sparse_tensor == 0)) X = np.zeros(np.insert(dim, 0, len(dim))) # \boldsymbol{\mathcal{X}} T = np.zeros(np.insert(dim, 0, len(dim))) # \boldsymbol{\mathcal{T}} Z = sparse_tensor.copy() last_tensor = sparse_tensor.copy() snorm = np.sqrt(np.sum(sparse_tensor ** 2)) it = 0 while True: rho = min(rho * 1.05, 1e5) for k in range(len(dim)): X[k] = mat2ten(svt_tnn(ten2mat(Z - T[k] / rho, k), alpha[k], rho, np.int(np.ceil(theta * dim[k]))), dim, k) Z[pos_missing] = np.mean(X + T / rho, axis = 0)[pos_missing] T = T + rho * (X - np.broadcast_to(Z, np.insert(dim, 0, len(dim)))) tensor_hat = np.einsum('k, kmnt -> mnt', alpha, X) tol = np.sqrt(np.sum((tensor_hat - last_tensor) ** 2)) / snorm last_tensor = tensor_hat.copy() it += 1 if (it + 1) % 50 == 0: print('Iter: {}'.format(it + 1)) print('RMSE: {:.6}'.format(compute_rmse(dense_tensor[pos_test], tensor_hat[pos_test]))) print() if (tol < epsilon) or (it >= maxiter): break print('Imputation MAPE: {:.6}'.format(compute_mape(dense_tensor[pos_test], tensor_hat[pos_test]))) print('Imputation RMSE: {:.6}'.format(compute_rmse(dense_tensor[pos_test], tensor_hat[pos_test]))) print() return tensor_hat ``` - Let us try it on Guangzhou urban traffic speed data set (Gdata): ```python import scipy.io tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat') dense_tensor = tensor['tensor'] random_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/random_tensor.mat') random_tensor = random_tensor['random_tensor'] missing_rate = 0.2 ### Random missing (RM) scenario: binary_tensor = np.round(random_tensor + 0.5 - missing_rate) sparse_tensor = np.multiply(dense_tensor, binary_tensor) ``` - Run the imputation experiment: ```python import time start = time.time() alpha = np.ones(3) / 3 rho = 1e-5 theta = 0.30 epsilon = 1e-4 maxiter = 200 LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter) end = time.time() print('Running time: %d seconds'%(end - start)) ``` > This example is from [../experiments/Imputation-LRTC-TNN.ipynb](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/experiments/Imputation-LRTC-TNN.ipynb), you can check out this Jupyter Notebook for advanced usage. Toy Examples -------------- - Time series forecasting - [Structured low-rank matrix completion](https://nbviewer.jupyter.org/github/xinychen/transdim/blob/master/toy-examples/SLRMC.ipynb) - Time series imputation Our Publications -------------- - Xinyu Chen, Yixian Chen, Lijun Sun (2020). **Scalable low-rank autoregressive tensor learning for spatiotemporal traffic data imputation**. arXiv: 2008.03194. [[preprint](https://arxiv.org/abs/2008.03194)] [[data](https://doi.org/10.5281/zenodo.3939792)] [[Python code](https://github.com/xinychen/tensor-learning)] - Xinyu Chen, Lijun Sun (2020). **Low-rank autoregressive tensor completion for multivariate time series forecasting**. arXiv: 2006.10436. [[preprint](https://arxiv.org/abs/2006.10436)] [[data & Python code](https://github.com/xinychen/tensor-learning)] - Xinyu Chen, Jinming Yang, Lijun Sun (2020). **A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation**. Transportation Research Part C: Emerging Technologies, 117: 102673. [[preprint](https://arxiv.org/abs/2003.10271)] [[doi](https://doi.org/10.1016/j.trc.2020.102673)] [[data & Python code](https://github.com/xinychen/transdim)] - Xinyu Chen, Lijun Sun (2019). **Bayesian temporal factorization for multidimensional time series prediction**. arXiv: 1910.06366. [[preprint](https://arxiv.org/abs/1910.06366)] [[slide](https://xinychen.github.io/paper/Bayesian-temporal-factorization-slide.pdf)] [[data & Python code](https://github.com/xinychen/transdim)] - Xinyu Chen, Zhaocheng He, Yixian Chen, Yuhuan Lu, Jiawei Wang (2019). **Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model**. Transportation Research Part C: Emerging Technologies, 104: 66-77. [[preprint](https://xinychen.github.io/paper/BATF.pdf)] [[doi](https://doi.org/10.1016/j.trc.2019.03.003)] [[slide](https://doi.org/10.5281/zenodo.2632552)] [[data](http://doi.org/10.5281/zenodo.1205229)] [[Matlab code](https://github.com/sysuits/BATF)] - Xinyu Chen, Zhaocheng He, Lijun Sun (2019). **A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation**. Transportation Research Part C: Emerging Technologies, 98: 73-84. [[preprint](https://www.researchgate.net/publication/329177786_A_Bayesian_tensor_decomposition_approach_for_spatiotemporal_traffic_data_imputation)] [[doi](https://doi.org/10.1016/j.trc.2018.11.003)] [[data](http://doi.org/10.5281/zenodo.1205229)] [[Matlab code](https://github.com/lijunsun/bgcp_imputation)] [[Python code](https://github.com/xinychen/transdim/blob/master/experiments/Imputation-BGCP.ipynb)] - Xinyu Chen, Zhaocheng He, Jiawei Wang (2018). **Spatial-temporal traffic speed patterns discovery and incomplete data recovery via SVD-combined tensor decomposition**. Transportation Research Part C: Emerging Technologies, 86: 59-77. [[doi](http://doi.org/10.1016/j.trc.2017.10.023)] [[data](http://doi.org/10.5281/zenodo.1205229)] >This project is from the above papers, please cite these papers if they help your research. Collaborators --------------![]() Xinyu Chen 💻 |
![]() Jinming Yang 💻 |
![]() Yixian Chen 💻 |
![]() Lijun Sun 💻 |
![]() Tianyang Han 💻 |
![]() Lijun Sun 💻 |