DataDeps.jl and other foundational tools for data driven research

This talk will cover the fundamental process of getting from a dataset on a web-server, into data in your program. Almost all empirical research work is data driven. This is particularly true of any field that is using machine learning. As such, setting up your data environment in a repeatable and clean way is essential for producing replicable research. Similarly, many packages have some requirement on data included to function, for example WordNet.jl requires the WordNet database. Deploying a package based on using an already trained machine learning model requires downloading that model. This talk will primarily focus on DataDeps.jl which allows for the automatic installation and management of data dependencies. For researchers and package developers DataDeps.jl solves 3 important issues:

Storage location: Where do I put it?
Should it be on the local disk (small) or the network file-store (slow)?
If I move it, I’m going to have to reconfigure things.
Redistribution: I didn’t create this this data
Am I allowed to redistribute it?
How will I give credit, and ensure the users know who the original creator was?
Replication: How can I be sure that someone running my code has the same data?
What if they download the wrong data, or extract it incorrectly?
What if it gets corrupted or modified? On top of this: by allowing fully automate Data Dependency setup, end to end automated testing becomes possible.

To achieve this DataDeps.jl needs each data dependency to be declared. This declaration requires information such as the name of the dataset, it’s URLs, a checksum, and who to give credit to for its original creation etc. I found myself copy-pasting that data from the websites. DataDepsGenerators.jl is a package that can generate this code given a link to a supported webpage describing. This makes it really easy to just grab someone else’s published data, and depend upon it. Then DataDeps.jl will resolve that dependency to get the data onto your machine. Once you’ve got the data onto your machine, the final stage is to load it up into a structure Julia can work with. For tabular data, julia has you well covered with a number of packages like JuliaDB, DataFrames.jl and many other supporting packages. MLDatasets.jl, uses DataDeps.jl as a backend, provides specialised methods for accessing various commonly used machine learning datasets. CorpusLoaders.jl provides a similar service for natural language corpora. Corpora often have factors that differ from other types of data.

They often require tokenisation to become usable, for which we use WordTokenizers.jl.
Tokenization increases the memory used: to decreases this we use InternedStrings.jl; and load them lazily via iterators.
To handle the hierarchical structure (Document, Paragraph, Sentence, Word) of these iterators we introduce MultiResolutionIterators.jl. Julia is excellent for data driven science, and this talk will help you understand how you can handle your data in a more robust way.

Packages discussed Packages discussed in great detail:

DataDeps.jl: This manages Data Dependencies.
CorpusLoaders.jl: It is a data package building on roughly every other package mentioned here. Packages discussed in significant detail:
DataDepsGenerators.jl: This converts URLs pointing webpages containing metadata, into code for DataDeps.jl
MultiResolutionIterators.jl it is the core of having a good API for CorpusLoaders.
InternedStrings.jl For decreasing memory usage, and speeding up equality checks.
WordTokenizers.jl it is a natural language tokenization and string splitting package. Packages mentioned:
MLDatasets.jl: a package full of datasets, similar overall to CorpusLoaders.jl but with some significant differences in philosophy and default assumptions.
MLDataUtils.jl: for most non-domain specific data wrangling before you feed your data to a machine learning system
WordNet.jl: the julia interface to the WordNet lexical resource.
DataFrames.jl for working with tabular data.
JuliaDB for working with n-dimensional tabular data.
MD5.jl and SHA.jl: for checksums for DataDeps.jl

DataDeps.jl and other foundational tools for data driven research

Speaker's bio