GSReg.jl: High Performance Computing in Econometrics. Let’s do it faster and simpler with Julia

Econometrics allows researchers to deal with competing theories about complex processes which could lead to very different diagnosis and even opposite policy advises. Advances in model selection techniques, driven by the increasing number of available HPC algorithms, constitutes one of its major contributions to Economic Theory. In-sample model selection has benefited from using mathematical developments to reduce the search space (e.g. Branch and Bound theorems) and efficiently find the best subset regression (in terms of defined information criteria). It is also possible to use heuristic approaches like Genetic algorithms, or different dimension reduction methods like Stepwise, Lasso or Ridge estimators. While failing to guarantee global optimality, they are fast and well-suited for Big and/or Sparse Data. On the contrary, out-of-sample model selection techniques require more computing resources. In a Machine Learning environment (e.g. problems focusing on predictive analysis) there is an increasing universe of “training/test” algorithms (many of them showing very interesting performance in Julia) to compare alternative results and find-out a suitable model. In Econometrics (e.g. problems focusing on causal inference) we require five important features which narrow the set of available algorithms: 1) Parsimony (to avoid very large atheoretical models); 2) Interpretability (for causal inference, rejecting “intuition-loss” transformation and/or complex combinations); 3) Across-models sensitivity analysis (economic theory is preferred against “best-model” unique results); 4) Robustness to time series and panel data information (preventing the use of raw bootstrapping or random subsample selection for training and test sets); and 5) advanced residual properties (e.g. going beyond the i.i.d assumption and looking for additional panel structure properties -for each model being evaluated-, which force a departure from many algorithms). For these reasons, most economists prefer flexible all-subset-regression approaches, choosing among alternative models by means of some out-of-sample criteria, model averaging results, theoretical limits on covariates coefficients and residual constraints. While still unfeasible for Sparse Data (p>>n), hardware and software innovations allow researchers to choose among one billion models in a few hours using a standard personal computer. Therefore, almost all statistical applications have an all-subset regression function. Some of them, have even developed a parallel version of their core algorithm (pdredge in R or gsregp in Stata). The objective of this talk is to introduce GSReg.jl (https://github.com/ParallelGSReg/GSReg.jl), a new package to perform the all-subset-regression approach exploiting Julia’s parallel capabilities and allowing users to choose between a simple GUI (https://github.com/ParallelGSReg/GSRegGUI) or an R/Stata-friendly command line interface. We will discuss its main features, pros and cons, limitations, future extensions and differences with similar existing Julia packages like BestSubsetRegressionl.jl, LARS.jl Lasso.jl, Mads.jl, ParallelSparseRegression.jl or SubsetSelection.jl. We will show that GSReg.jl is 4 to 10 times faster than other similar alternatives and more than 100 times faster than the original (sequential) Gsreg Stata version (using a last generation personal computer). A forthcoming paper will include programming details, profiling data, extended examples and benchmarking results.

Speaker's bio

HPC, Parallel Programming, Disitributed Systems, Applying parallel resources to real life applications