Videos from JuliaCon are now available online
2018
Previous editions: 2017 | 2016 | 2015 | 2014
Lyndon White



DataDeps.jl and other foundational tools for data driven research

This talk will cover the fundamental process of getting from a dataset on a web-server, into data in your program. Almost all empirical research work is data driven. This is particularly true of any field that is using machine learning. As such, setting up your data environment in a repeatable and clean way is essential for producing replicable research. Similarly, many packages have some requirement on data included to function, for example WordNet.jl requires the WordNet database. Deploying a package based on using an already trained machine learning model requires downloading that model. This talk will primarily focus on DataDeps.jl which allows for the automatic installation and management of data dependencies. For researchers and package developers DataDeps.jl solves 3 important issues:

To achieve this DataDeps.jl needs each data dependency to be declared. This declaration requires information such as the name of the dataset, it’s URLs, a checksum, and who to give credit to for its original creation etc. I found myself copy-pasting that data from the websites. DataDepsGenerators.jl is a package that can generate this code given a link to a supported webpage describing. This makes it really easy to just grab someone else’s published data, and depend upon it. Then DataDeps.jl will resolve that dependency to get the data onto your machine. Once you’ve got the data onto your machine, the final stage is to load it up into a structure Julia can work with. For tabular data, julia has you well covered with a number of packages like JuliaDB, DataFrames.jl and many other supporting packages. MLDatasets.jl, uses DataDeps.jl as a backend, provides specialised methods for accessing various commonly used machine learning datasets. CorpusLoaders.jl provides a similar service for natural language corpora. Corpora often have factors that differ from other types of data.

Packages discussed Packages discussed in great detail:

Speaker's bio