New in DataStreams.jl: Type flexibility, querying, and parallelism, oh my!
The DataStreams.jl framework is behind a number of key packages in the Julia data ecosystem. At it’s core, it defines the “Source” and “Sink” interfaces that various formats can implement to automatically integrate with other formats that also implement the interfaces. This solves the one-to-many interop problem that always plagues data formats (“what? it only takes CSV files??”). With DataStreams, it’s quick and easy to implement the interface and automatically hook into the rest of the Julia data ecosystem.
So what’s new and noteworthy in DataStreams?
- Flexibly typed schemas: a long-standing issue with any sort of data transfer is how to align expected types between source and sink; Base julia itself has pioneered a flexible, yet performant solution in it’s implementation of map over collections. This same approach has been applied to DataStreams to allow dynamic, type-inference-independent transfer from sources to sinks.
- IO-integrated querying functionality: how many times have you thought, “sheesh, I wish there was a way to only parse a few columns, filter out certain values, and apply a transformation to this csv file all at the same time!” Ok, maybe not those words exactly, but with DataStreams, now you can! A fully integrated query-planner can now take any number of transformations and apply them at the IO-level to avoid more data transfer than absolutely necessary and fuse them all together in tightly compiled Julia code.
- Data transfer parallelism: Sinks can now signal to Sources that they support parallel streaming; this can lead to massive improvements in data throughput and fully leveraging a system’s resources
Speaker's bio
Attended Carnegie Mellon for a master’s degree in data science and active Julia contributor for 4 years now.