Basics
JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.
Here is a high level overview of tables in JuliaDB:
- Tables store data in columns.
- Tables are typed.
- Changing a table in some way therefore requires returning a new table (underlying data is not copied).
- JuliaDB has few mutating operations because a new table is necessary in most cases.
Data for examples:
x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);10-element Vector{Float64}:
-0.0468656624370505
-0.5433314559561914
0.11957453709749813
0.8859928959094242
-1.917342436712874
0.5597762348401842
0.30746180863333816
-1.2553543812297594
-1.519593213837593
0.5610472334945272IndexedTable
An IndexedTable is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x and :y).
An IndexedTable is created with data in Julia via the table function or with data on disk via the loadtable function.
julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])Table with 10 rows, 3 columns: x y z ─────────────────── 1 'a' -0.0468657 2 'a' -0.543331 3 'a' 0.119575 4 'a' 0.885993 5 'b' -1.91734 6 'b' 0.559776 7 'b' 0.307462 8 'b' -1.25535 9 'b' -1.51959 10 'b' 0.561047julia> t[1](x = 1, y = 'a', z = -0.0468656624370505)julia> t[end](x = 10, y = 'b', z = 0.5610472334945272)
NDSparse
An NDSparse has a similar underlying structure to IndexedTable, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse are sorted, much like the primary keys of an IndexedTable.
An NDSparse is created with data in Julia via the ndsparse function or with data on disk via the loadndsparse function.
julia> nd = ndsparse((x=x, y=y), (z=z,))2-d NDSparse with 10 values (1 field named tuples): x y │ z ────────┼─────────── 1 'a' │ -0.0468657 2 'a' │ -0.543331 3 'a' │ 0.119575 4 'a' │ 0.885993 5 'b' │ -1.91734 6 'b' │ 0.559776 7 'b' │ 0.307462 8 'b' │ -1.25535 9 'b' │ -1.51959 10 'b' │ 0.561047julia> nd[1, 'a'](z = -0.0468656624370505,)julia> nd[10, 'j'].zERROR: KeyError: key (10, 'j') not foundjulia> nd[1, :]1-d NDSparse with 1 values (1 field named tuples): y │ z ────┼─────────── 'a' │ -0.0468657
Selectors
JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select, reduce, groupreduce, groupby, join, transform, reindex, and more.
To demonstrate selection, we'll use the select function. A selection can be any of the following types:
Integer– returns the column at this position.Symbol– returns the column with this name.Pair{Selection => Function}– selects and maps a function over the selection, returns the result.AbstractArray– returns the array itself. This must be the same length as the table.TupleofSelection– returns a table containing a column for every selector in the tuple.Regex– returns the columns with names that match the regular expression.Type– returns columns with elements of the given type.Not(Selection)– returns columns that are not included in the selection.Between(first, last)– returns columns betweenfirstandlast.Keys()– return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])Table with 10 rows, 3 columns:
x y z
────────────────────
1 1.07085 false
2 -0.960728 false
3 -0.81665 false
4 -0.111198 false
5 2.24545 true
6 -0.811255 false
7 1.15603 true
8 0.759486 false
9 1.64535 false
10 -0.601752 falseselect the :x vector
julia> select(t, 1)10-element Vector{Int64}: 1 2 3 4 5 6 7 8 9 10julia> select(t, :x)10-element Vector{Int64}: 1 2 3 4 5 6 7 8 9 10
map a function to the :y vector
julia> select(t, 2 => abs)10-element Vector{Float64}: 1.0708479992338487 0.9607276276622873 0.8166498931995329 0.11119787501241411 2.24545239704408 0.8112545684559442 1.1560277290477903 0.7594856747567902 1.645345579829251 0.6017516091385081julia> select(t, :y => x -> x > 0 ? x : -x)10-element Vector{Float64}: 1.0708479992338487 0.9607276276622873 0.8166498931995329 0.11119787501241411 2.24545239704408 0.8112545684559442 1.1560277290477903 0.7594856747567902 1.645345579829251 0.6017516091385081
select the table of :x and :z
julia> select(t, (:x, :z))Table with 10 rows, 2 columns: x z ───────── 1 false 2 false 3 false 4 false 5 true 6 false 7 true 8 false 9 false 10 falsejulia> select(t, r"(x|z)")Table with 10 rows, 2 columns: x z ───────── 1 false 2 false 3 false 4 false 5 true 6 false 7 true 8 false 9 false 10 false
map a function to the table of :x and :y
julia> select(t, (:x, :y) => row -> row[1] + row[2])10-element Vector{Float64}: 2.0708479992338487 1.0392723723377126 2.183350106800467 3.8888021249875857 7.24545239704408 5.1887454315440555 8.156027729047791 8.75948567475679 10.64534557982925 9.398248390861491julia> select(t, (1, :y) => row -> row.x + row.y)10-element Vector{Float64}: 2.0708479992338487 1.0392723723377126 2.183350106800467 3.8888021249875857 7.24545239704408 5.1887454315440555 8.156027729047791 8.75948567475679 10.64534557982925 9.398248390861491
select columns that are subtypes of Integer
julia> select(t, Integer)Table with 10 rows, 2 columns: x z ───────── 1 false 2 false 3 false 4 false 5 true 6 false 7 true 8 false 9 false 10 false
select columns that are not subtypes of Integer
julia> select(t, Not(Integer))Table with 10 rows, 1 columns: y ───────── 1.07085 -0.960728 -0.81665 -0.111198 2.24545 -0.811255 1.15603 0.759486 1.64535 -0.601752
Loading and Saving
Loading Data From CSV
Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable and loadndsparse functions.
using JuliaDB, DelimitedFiles
x = rand(10, 2)
writedlm("temp.csv", x, ',')
t = loadtable("temp.csv")Table with 9 rows, 2 columns:
0.3454284142279518 0.33009330171710827
───────────────────────────────────────
0.675905 0.162582
0.220677 0.197045
0.967087 0.0651113
0.0240857 0.204306
0.59852 0.521566
0.946908 0.799856
0.711715 0.60849
0.0838515 0.659101
0.994414 0.549732loadtable and loadndsparse use Missing to represent missing values. To load a CSV that instead uses DataValue, see CSVFiles.jl. For more information on missing value representations, see Missing Values.
Converting From Other Data Structures
using JuliaDB, RDatasets
df = dataset("datasets", "iris") # load data as DataFrame
table(df) # Convert DataFrame to IndexedTableTable with 150 rows, 5 columns:
SepalLength SepalWidth PetalLength PetalWidth Species
─────────────────────────────────────────────────────────────
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
⋮
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"Save Table into Binary Format
A table can be saved to disk (for fast, efficient reloading) via the save function.
Load Table from Binary Format
Tables that have been save-ed can be loaded efficiently via load.