API
JuliaDB.JuliaDBIndexedTables.ColDictIndexedTables.IndexedTableIndexedTables.KeysIndexedTables.NDSparseIndexedTables.PermJuliaDB.DIndexedTableJuliaDB.DNDSparseJuliaDB.IndexSpaceJuliaDB.IntervalBase.Broadcast.broadcastBase.collectBase.convertBase.filterBase.getindexBase.joinBase.keysBase.lengthBase.mapBase.mapBase.mapBase.mergeBase.pairsBase.reduceBase.reduceBase.sortBase.sort!Base.valuesDagger.computeDagger.distributeDagger.distributeDagger.distributeDagger.loadDagger.saveDagger.saveIndexedTables.aggregate!IndexedTables.arrayofIndexedTables.asofjoinIndexedTables.best_perm_estimateIndexedTables.collect_columnsIndexedTables.colnamesIndexedTables.columnsIndexedTables.columnsIndexedTables.convertdimIndexedTables.convertdimIndexedTables.convertmissingIndexedTables.dimlabelsIndexedTables.dropmissingIndexedTables.excludecolsIndexedTables.flattenIndexedTables.flush!IndexedTables.groupbyIndexedTables.groupjoinIndexedTables.groupreduceIndexedTables.insertcolsIndexedTables.insertcolsafterIndexedTables.insertcolsbeforeIndexedTables.leftjoinIndexedTables.map_rowsIndexedTables.naturaljoinIndexedTables.naturaljoinIndexedTables.ncolsIndexedTables.ndsparseIndexedTables.pkeynamesIndexedTables.pkeynamesIndexedTables.pkeysIndexedTables.reducedim_vecIndexedTables.reducedim_vecIndexedTables.reindexIndexedTables.renameIndexedTables.rowsIndexedTables.selectIndexedTables.selectkeysIndexedTables.selectvaluesIndexedTables.stackIndexedTables.summarizeIndexedTables.tableIndexedTables.unstackIndexedTables.update!IndexedTables.whereJuliaDB.fromchunksJuliaDB.loadndsparseJuliaDB.loadtableJuliaDB.mapchunksJuliaDB.partitionplotJuliaDB.rechunkJuliaDB.tracktimeStatsBase.transform
JuliaDB.JuliaDB — ModuleJuliaDBJuliaDB is a package for working with large persistent data sets.
JuliaDB is an all-Julia, end-to-end tool that can
- Load multi-dimensional datasets quickly and incrementally
- Index the data and perform filter, aggregate, sort, and join operations
- Save results and load them efficiently later
- Use Julia's built-in parallelism to fully utilize any machine or cluster
Introduce yourself to JuliaDB's features at juliadb.org or jump into the documentation here!
JuliaDB.DIndexedTable — TypeA distributed table
JuliaDB.DNDSparse — TypeDNDSparse{K,V} <: AbstractNDSparseA distributed NDSparse datastructure. Can be constructed by:
ndsparsefrom Julia objectsloadndsparsefrom data on diskdistributefrom anNDSparseobject
JuliaDB.IndexSpace — TypeIndexSpace(interval, boundingrect, nrows)Metadata about an chunk.
interval: AnIntervalobject with the first and the last index tuples.boundingrect: AnIntervalobject with the lowest and the highest indices as tuples.nrows: ANullable{Int}of number of rows in the NDSparse, if knowable.
JuliaDB.Interval — TypeAn interval type tailored specifically to store intervals of indices of an NDSparse object. Some of the operations on this like in or < may be controversial for a generic Interval type.
Base.collect — Methodcollect(t::DNDSparse)Gets distributed data in a DNDSparse t and merges it into NDSparse object
Base.getindex — Methodt[idx...]Returns a DNDSparse containing only the elements of t where the given indices (idx) match. If idx has the same type as the index tuple of the t, then this is considered a scalar indexing (indexing of a single value). In this case the value itself is looked up and returned.
Base.length — MethodThe length of the DNDSparse if it can be computed. Will throw an error if not. You can get the length of such tables after calling compute on them.
Base.map — Methodmap(f, t::DNDSparse)Applies a function f on every element in the data of table t.
Dagger.compute — Methodcompute(t::DNDSparse; allowoverlap, closed)Computes any delayed-evaluations in the DNDSparse. The computed data is left on the worker processes. Subsequent operations on the results will reuse the chunks.
If allowoverlap is false then the computed data is re-sorted if required to have no chunks with overlapping index ranges if necessary.
If closed is true then the computed data is re-sorted if required to have no chunks with overlapping OR continuous boundaries.
See also collect.
compute(t) requires at least as much memory as the size of the result of the computing t. You usually don't need to do this for the whole dataset. If the result is expected to be big, try compute(save(t, "output_dir")) instead. See save for more.
Dagger.distribute — Functiondistribute(itable::NDSparse, nchunks::Int=nworkers())Distributes an NDSparse object into a DNDSparse of nchunks chunks of approximately equal size.
Returns a DNDSparse.
Dagger.distribute — Methoddistribute(t::Table, chunks)Distribute a table in chunks pieces. Equivalent to table(t, chunks=chunks).
Dagger.distribute — Methoddistribute(itable::NDSparse, rowgroups::AbstractArray)Distributes an NDSparse object into a DNDSparse by splitting it up into chunks of rowgroups elements. rowgroups is a vector specifying the number of rows in the chunks.
Returns a DNDSparse.
Dagger.load — Methodload(dir::AbstractString)
Load a saved DNDSparse from dir directory. Data can be saved using the save function.
Dagger.save — Methodsave(t::Union{NDSparse, IndexedTable}, dest::AbstractString)Save a dataset to disk as dest. Saved data can be loaded with load.
Dagger.save — Methodsave(t::Union{DNDSparse, DIndexedTable}, destdir::AbstractString)Saves a distributed dataset to disk in directory destdir. Saved data can be loaded with load.
IndexedTables.convertdim — Methodconvertdim(x::DNDSparse, d::DimName, xlate; agg::Function, name)Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg is used to aggregate the results. name optionally specifies a name for the new dimension. xlate must be a monotonically increasing function.
See also reduce
IndexedTables.leftjoin — Methodleftjoin(left::DNDSparse, right::DNDSparse, [op::Function])Keeps only rows with indices in left. If rows of the same index are present in right, then they are combined using op. op by default picks the value from right.
IndexedTables.naturaljoin — Methodnaturaljoin(op, left::DNDSparse, right::DNDSparse, ascolumns=false)Returns a new DNDSparse containing only rows where the indices are present both in left AND right tables. The data columns are concatenated. The data of the matching rows from left and right are combined using op. If op returns a tuple or NamedTuple, and ascolumns is set to true, the output table will contain the tuple elements as separate data columns instead as a single column of resultant tuples.
IndexedTables.naturaljoin — Methodnaturaljoin(left::DNDSparse, right::DNDSparse, [op])Returns a new DNDSparse containing only rows where the indices are present both in left AND right tables. The data columns are concatenated.
IndexedTables.reducedim_vec — Methodreducedim_vec(f::Function, t::DNDSparse, dims)Like reducedim, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.
See also reducedim.
JuliaDB.fromchunks — Methodfromchunks(cs)Construct a distributed object from chunks. Calls fromchunks(T, cs) where T is the type of the data in the first chunk. Computes any thunks.
JuliaDB.loadndsparse — Methodloadndsparse(files::Union{AbstractVector,String}; <options>)
Load an NDSparse from CSV files.
files is either a vector of file paths, or a directory name.
Options:
indexcols::Vector– columns to use as indexed columns. (by default a1:nimplicit index is used.)datacols::Vector– non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g.vendor_idandVendorId)
All other options are identical to those in loadtable
JuliaDB.loadtable — Methodloadtable(files::Union{AbstractVector,String}; <options>)
Load a table from CSV files.
files is either a vector of file paths, or a directory name.
Options:
output::AbstractString– directory name to write the table to. By default data is loaded directly to memory. Specifying this option will allow you to load data larger than the available memory.indexcols::Vector– columns to use as primary key columns. (defaults to [])datacols::Vector– non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g.vendor_idandVendorId)distributed::Bool– should the output dataset be loaded as a distributed table? If true, this will use all available worker processes to load the data. (defaults to true if workers are available, false if not)chunks::Int– number of chunks to create when loading distributed. (defaults to number of workers)delim::Char– the delimiter character. (defaults to,). Usespacedelim=trueto split by spaces.spacedelim::Bool: parse space-delimited files.delimhas no effect if true.quotechar::Char– quote character. (defaults to")escapechar::Char– escape character. (defaults to")filenamecol::Union{Symbol, Pair}– create a column containing the file names from where each row came from. This argument gives a name to the column. By default,basename(name)of the name is kept, and ".csv" suffix will be stripped. To provide a custom function to apply on the names, use aname => Functionpair. By default, no file name column will be created.header_exists::Bool– does header exist in the files? (defaults to true)colnames::Vector{String}– specify column names for the files, use this with (header_exists=false, otherwise first row is discarded). By default column names are assumed to be present in the file.samecols– a vector of tuples of strings where each tuple contains alternative names for the same column. For example, if some files have the name "vendorid" and others have the name "VendorID", pass `samecols=[("VendorID", "vendorid")]`.colparsers– either a vector or dictionary of data types or anAbstractTokenobject from TextParse package. By default, these are inferred automatically. Seetype_detect_rowsoption below.type_detect_rows: number of rows to use to infer the initialcolparsersdefaults to 20.nastrings::Vector{String}– strings that are to be considered missing values. (defaults toTextParse.NA_STRINGS)skiplines_begin::Char– skip some lines in the beginning of each file. (doesn't skip by default)usecache::Bool: (vestigial)
JuliaDB.mapchunks — Methodmapchunks(f, t::DNDSparse; keeplengths=true)Applies a function to each chunk in t. Returns a new DNDSparse. If keeplength is false, this means that the lengths of the output chunks is unknown before compute. This function is used internally by many DNDSparse operations.
JuliaDB.partitionplot — Functionpartitionplot(table, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)
partitionplot(table, x, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)Plot a summary of variable y against x (1:length(y) if not specified). Using nparts approximately-equal sections along the x-axis, the data in y over each section is summarized by stat.
JuliaDB.rechunk — Functionrechunk(t::Union{DNDSparse, DNDSparse}[, by[, select]]; <options>)
Reindex and sort a distributed dataset by keys selected by by.
Optionally select specifies which non-indexed fields are kept. By default this is all fields not mentioned in by for Table and the value columns for NDSparse.
Options:
chunks– how to distribute the data. This can be:- An integer – number of chunks to create
- An vector of
kintegers – number of elements in each of thekchunks.sum(k)must be same aslength(t) - The distribution of another array. i.e.
vec.subdomainswherevecis a distributed array.
merge::Function– a function which merges two sub-table or sub-ndsparse into one NDSparse. They may have overlaps in their indices.splitters::AbstractVector– specify keys to split by. To createnchunks you would need to passn-1splitters and also thechunks=noption.chunks_sorted::Bool– are the chunks sorted locally? If true, this skips sorting or re-indexing them.affinities::Vector{<:Integer}– which processes (Int pid) should each output chunk be created on. If unspecified all workers are used.closed::Bool– if true, the same key will not be present in multiple chunks (although sorted).trueby default.nsamples::Integer– number of keys to randomly sample from each chunk to estimate splitters in the sorting process. (See samplesort). Defaults to 2000.batchsize::Integer– how many chunks at a time from the input should be loaded into memory at any given time. This will essentially sort in batches ofbatchsizechunks.
JuliaDB.tracktime — Methodtracktime(f)
Track the time spent on different processes in different categories in running f.
IndexedTables.ColDict — Methodd = ColDict(t)Create a mutable dictionary of columns in t.
To get the immutable iterator of the same type as t call d[]
IndexedTables.IndexedTable — TypeIndexedTables.Keys — TypeKeys()Select the primary keys.
Examples
t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Keys())IndexedTables.NDSparse — MethodNDSparse(columns...; names=Symbol[...], kwargs...)
Construct an NDSparse array from columns. The last argument is the data column, and the rest are index columns. The names keyword argument optionally specifies names for the index columns (dimensions).
IndexedTables.Perm — TypeA permutation
Fields:
columns: The columns being indexed as a vector of integers (column numbers)perm: the permutation - an array or iterator which has the sorted permutation
Base.Broadcast.broadcast — Methodbroadcast(f, A::NDSparse, B::NDSparse; dimmap::Tuple{Vararg{Int}})
A .* BCompute an inner join of A and B using function f, where the dimensions of B are a subset of the dimensions of A. Values from B are repeated over the extra dimensions.
dimmap optionally specifies how dimensions of A correspond to dimensions of B. It is a tuple where dimmap[i]==j means the ith dimension of A matches the jth dimension of B. Extra dimensions that do not match any dimensions of j should have dimmap[i]==0.
If dimmap is not specified, it is determined automatically using index column names and types.
Example
a = ndsparse(([1,1,2,2], [1,2,1,2]), [1,2,3,4])
b = ndsparse([1,2], [1/1, 1/2])
broadcast(*, a, b)dimmap maps dimensions that should be broadcasted:
broadcast(*, a, b, dimmap=(0,1))Base.convert — Methodconvert(IndexedTable, pkeys, vals; kwargs...)Construct a table with pkeys as primary keys and vals as corresponding non-indexed items. keyword arguments will be forwarded to table constructor.
Example
convert(IndexedTable, Columns(x=[1,2],y=[3,4]), Columns(z=[1,2]), presorted=true)Base.filter — Methodfilter(f, t::Union{IndexedTable, NDSparse}; select)Iterate over t and Return the rows for which f(row) returns true. select determines the rows that are given as arguments to f (see select).
f can also be a tuple of column => function pairs. Returned rows will be those for which all conditions are true.
Example
# filter iterates over ROWS of a IndexedTable
t = table(rand(100), rand(100), rand(100), names = [:x, :y, :z])
filter(r -> r.x + r.y + r.z < 1, t)
# filter iterates over VALUES of an NDSparse
x = ndsparse(1:100, randn(100))
filter(val -> val > 0, x)Base.join — Methodjoin(left, right; kw...)
join(f, left, right; kw...)Join tables left and right.
If a function f(leftrow, rightrow) is provided, the returned table will have a single output column. See the Examples below.
If the same key occurs multiple times in either table, each left row will get matched with each right row, resulting in n_occurrences_left * n_occurrences_right output rows.
Options (keyword arguments)
how = :inner- Join method to use. Described below.
lkey = pkeys(left)- Fields from
leftto match on (seepkeys).
- Fields from
rkey = pkeys(right)- Fields from
rightto match on.
- Fields from
lselect = Not(lkey)- Output columns from
left(seeNot)
- Output columns from
rselect = Not(rkey)- Output columns from
right.
- Output columns from
missingtype = Missing- Type of missing values that can be created through
:leftand:outerjoins. - Other supported option is
DataValue.
- Type of missing values that can be created through
Join methods (how = :inner)
:inner– rows with matching keys in both tables:left– all rows fromleft, plus matched rows fromright(missing values can occur):outer– all rows from both tables (missing values can occur):anti– rows inleftWITHOUT matching keys inright
Examples
a = table((x = 1:10, y = rand(10)), pkey = :x)
b = table((x = 1:2:20, z = rand(10)), pkey = :x)
join(a, b; how = :inner)
join(a, b; how = :left)
join(a, b; how = :outer)
join(a, b; how = :anti)
join((l, r) -> l.y + r.z, a, b)Base.keys — Methodkeys(x::NDSparse[, select::Selection])
Get the keys of an NDSparse object. Same as rows but acts only on the index columns of the NDSparse.
Base.map — Methodmap(f, t::IndexedTable; select)Apply f to every item in t selected by select (see also the select function). Returns a new table if f returns a tuple or named tuple. If not, returns a vector.
Examples
t = table([1,2], [3,4], names=[:x, :y])
polar = map(p -> (r = hypot(p.x, p.y), θ = atan(p.y, p.x)), t)
back2t = map(p -> (x = p.r * cos(p.θ), y = p.r * sin(p.θ)), polar)Base.map — Methodmap(f, x::NDSparse; select = values(x))Apply f to every value of select selected from x (see select).
Apply f to every data value in x. select selects fields passed to f. By default, the data values are selected.
If the return value of f is a tuple or named tuple the result will contain many data columns.
Examples
x = ndsparse((t=[0.01, 0.05],), (x=[1,2], y=[3,4]))
polar = map(row -> (r = hypot(row.x, row.y), θ = atan(row.y, row.x)), x)
back2x = map(row -> (x = row.r * cos(row.θ), y = row.r * sin(row.θ)), polar)Base.merge — Methodmerge(a::IndexedTable, b::IndexedTable; pkey)Merge rows of a with rows of b and remain ordered by the primary key(s). a and b must have the same column names.
merge(a::NDSparse, a::NDSparse; agg)Merge rows of a with rows of b. To keep unique keys, the value from b takes priority. A provided function agg will aggregate values from a and b that have the same key(s).
Example:
a = table((x = 1:5, y = rand(5)); pkey = :x)
b = table((x = 6:10, y = rand(5)); pkey = :x)
merge(a, b)
a = ndsparse([1,3,5], [1,2,3])
b = ndsparse([2,3,4], [4,5,6])
merge(a, b)
merge(a, b; agg = (x,y) -> x)Base.pairs — Methodpairs(arr::NDSparse, indices...)
Similar to where, but returns an iterator giving index=>value pairs. index will be a tuple.
Base.reduce — Methodreduce(f, t::IndexedTable; select::Selection)Apply reducer function f pair-wise to the selection select in t. The reducer f can be:
- A function
- An OnlineStat
- A (named) tuple of functions and/or OnlineStats
- A (named) tuple of (selector => function) or (selector => OnlineStat) pairs
Examples
t = table(1:5, 6:10, names = [:t, :x])
reduce(+, t, select = :t)
reduce((a, b) -> (t = a.t + b.t, x = a.x + b.x), t)
using OnlineStats
reduce(Mean(), t, select = :t)
reduce((Mean(), Variance()), t, select = :t)
y = reduce((min, max), t, select=:x)
reduce((sum = +, prod = *), t, select=:x)
# combining reduction and selection
reduce((xsum = :x => +, negtsum = (:t => -) => +), t)Base.reduce — Methodreduce(f, x::NDSparse; dims)Drop the dims dimension(s) and aggregate values with f.
x = ndsparse((x=[1,1,1,2,2,2],
y=[1,2,2,1,2,2],
z=[1,1,2,1,1,2]), [1,2,3,4,5,6])
reduce(+, x; dims=1)
reduce(+, x; dims=(1,3))Base.sort! — Methodsort!(t ; kw...)
sort!(t, by; kw...)Sort rows of t by by in place. All of Base.sort keyword arguments can be used.
Examples
t = table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z]);
sort!(t, :z, rev = true)
tBase.sort — Methodsort(t ; select, kw...)
sort(t, by; select, kw...)Sort rows by by. All of Base.sort keyword arguments can be used.
Examples
t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
sort(t, :z; select = (:y, :z), rev = true)Base.values — Methodvalues(x::NDSparse[, select::Selection])
Get the values of an NDSparse object. Same as rows but acts only on the value columns of the NDSparse.
IndexedTables.aggregate! — Methodaggregate!(f::Function, arr::NDSparse)Combine adjacent rows with equal indices using the given 2-argument reduction function, in place.
IndexedTables.arrayof — Methodarrayof(T)Returns the type of Columns or Vector suitable to store values of type T. Nested tuples beget nested Columns.
IndexedTables.asofjoin — Methodasofjoin(left::NDSparse, right::NDSparse)Join rows from left with the "most recent" value from right.
Example
using Dates
akey1 = ["A", "A", "B", "B"]
akey2 = [Date(2017,11,11), Date(2017,11,12), Date(2017,11,11), Date(2017,11,12)]
avals = collect(1:4)
bkey1 = ["A", "A", "B", "B"]
bkey2 = [Date(2017,11,12), Date(2017,11,13), Date(2017,11,10), Date(2017,11,13)]
bvals = collect(5:8)
a = ndsparse((akey1, akey2), avals)
b = ndsparse((bkey1, bkey2), bvals)
asofjoin(a, b)IndexedTables.best_perm_estimate — MethodReturns: (n, perm) where n is the number of columns in the beginning of cols, perm is one possible permutation of those first n columns.
IndexedTables.collect_columns — Methodcollect_columns(itr)Collect an iterable as a Columns object if it iterates Tuples or NamedTuples, as a normal Array otherwise.
Examples
s = [(1,2), (3,4)]
collect_columns(s)
s2 = Iterators.filter(isodd, 1:8)
collect_columns(s2)IndexedTables.colnames — Functioncolnames(itr)Returns the names of the "columns" in itr.
Examples:
colnames(1:3)
colnames(Columns([1,2,3], [3,4,5]))
colnames(table([1,2,3], [3,4,5]))
colnames(Columns(x=[1,2,3], y=[3,4,5]))
colnames(table([1,2,3], [3,4,5], names=[:x,:y]))
colnames(ndsparse(Columns(x=[1,2,3]), Columns(y=[3,4,5])))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns([1,2,3], [4,5,6]), Columns(x=[6,7,8])))
colnames(ndsparse(Columns(x=[1,2,3]), Columns([3,4,5],[6,7,8])))IndexedTables.columns — Functioncolumns(itr, select::Selection = All())Select one or more columns from an iterable of rows as a tuple of vectors.
select specifies which columns to select. Refer to the select function for the available selection options and syntax.
itr can be NDSparse, Columns, AbstractVector, or their distributed counterparts.
Examples
t = table(1:2, 3:4; names = [:x, :y])
columns(t)
columns(t, :x)
columns(t, (:x,))
columns(t, (:y, :x => -))IndexedTables.columns — Methodcolumns(itr, which)
Returns a vector or a tuple of vectors from the iterator.
IndexedTables.convertdim — Methodconvertdim(x::NDSparse, d::DimName, xlate; agg::Function, vecagg::Function, name)
Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg or vecagg is used to aggregate the results. If agg is passed, it is used as a 2-argument reduction function over the data. If vecagg is passed, it is used as a vector-to-scalar function to aggregate the data. name optionally specifies a new name for the translated dimension.
IndexedTables.convertmissing — Methodconvertmissing(tbl, missingtype)Convert the missing value representation in tbl to be of type missingtype.
Example
using IndexedTables, DataValues
t = table([1,2,missing], [1,missing,3])
IndexedTables.convertmissing(t, DataValue)IndexedTables.dimlabels — Methoddimlabels(t::NDSparse)
Returns an array of integers or symbols giving the labels for the dimensions of t. ndims(t) == length(dimlabels(t)).
IndexedTables.dropmissing — Functiondropmissing(t )
dropmissing(t, select)Drop rows of table t which contain missing values (either Missing or DataValue), optionally only using the columns in select. Column types will be converted to non-missing types. For example:
Vector{Union{Int, Missing}}->Vector{Int}DataValueArray{Int}-> Vector{Int}
Example
t = table([0.1,0.5,missing,0.7], [2,missing,4,5], [missing,6,missing,7], names=[:t,:x,:y])
dropmissing(t)
dropmissing(t, (:t, :x))IndexedTables.excludecols — Methodexcludecols(itr, cols) -> Tuple of IntNames of all columns in itr except cols. itr can be any of IndexedTable, NDSparse, StructArrays.StructVector, or AbstractVector
Examples
using IndexedTables: excludecols
t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))
excludecols(t, (:x,))
excludecols(t, (2,))
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))IndexedTables.flatten — Functionflatten(t::Table, col=length(columns(t)))Flatten col column which may contain a vector of iterables while repeating the other fields. If column argument is not provided, default to last column.
Examples:
x = table([1,2], [[3,4], [5,6]], names=[:x, :y])
flatten(x, 2)
t1 = table([3,4],[5,6], names=[:a,:b])
t2 = table([7,8], [9,10], names=[:a,:b])
x = table([1,2], [t1, t2], names=[:x, :y]);
flatten(x, :y)IndexedTables.flush! — Methodflush!(arr::NDSparse)
Commit queued assignment operations, by sorting and merging the internal temporary buffer.
IndexedTables.groupby — Functiongroupby(f, t, by = pkeynames(t); select, flatten=false, usekey = false)Apply f to the select-ed columns (see select) in groups defined by the unique values of by.
If f returns a vector, split it into multiple columns with flatten = true.
To retain the grouping key in the resulting group use usekey = true.
Examples
using Statistics
t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z])
groupby(mean, t, :x, select=:z)
groupby(identity, t, (:x, :y), select=:z)
groupby(mean, t, (:x, :y), select=:z)
groupby((mean, std, var), t, :y, select=:z)
groupby((q25=z->quantile(z, 0.25), q50=median, q75=z->quantile(z, 0.75)), t, :y, select=:z)
# apply different aggregation functions to different columns
groupby((ymean = :y => mean, zmean = :z => mean), t, :x)
# include the grouping key
groupby(t, by; usekey = true) do key, dd
# code using key as key (named tuple) and dd as data
endIndexedTables.groupjoin — Methodgroupjoin(left, right; kw...)
groupjoin(f, left, right; kw...)Join left and right creating groups of values with matching keys.
For keyword argument options, see join.
Examples
l = table([1,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:c], pkey=(:a, :b))
r = table([0,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:d], pkey=(:a, :b))
groupjoin(l, r)
groupjoin(l, r; how = :left)
groupjoin(l, r; how = :outer)
groupjoin(l, r; how = :anti)IndexedTables.groupreduce — Functiongroupreduce(f, t, by = pkeynames(t); select)Calculate a reduce operation f over table t on groups defined by the values in selection by. The result is put in a table keyed by the unique by values.
Examples
t = table([1,1,1,2,2,2], 1:6, names=[:x, :y])
groupreduce(+, t, :x; select = :y)
groupreduce((sum=+,), t, :x; select = :y) # change output column name to :sum
t2 = table([1,1,1,2,2,2], [1,1,2,2,3,3], 1:6, names = [:x, :y, :z])
groupreduce(+, t2, (:x, :y), select = :z)
# different reducers for different columns
groupreduce((sumy = :y => +, sumz = :z => +), t2, :x)IndexedTables.insertcols — Methodinsertcols(t, position::Integer, map::Pair...)For each pair name => col in map, insert a column col named name starting at position. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcol(t, 2, :w => [0,1])IndexedTables.insertcolsafter — Methodinsertcolsafter(t, after, map::Pair...)For each pair name => col in map, insert a column col named name after after. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsafter(t, :t, :w => [0,1])IndexedTables.insertcolsbefore — Methodinsertcolsbefore(t, before, map::Pair...)
For each pair name => col in map, insert a column col named name before before. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsbefore(t, :x, :w => [0,1])IndexedTables.map_rows — Methodmap_rows(f, c...)Transform collection c by applying f to each element. For multiple collection arguments, apply f elementwise. Collect output as Columns if f returns Tuples or NamedTuples with constant fields, as Array otherwise.
Examples
map_rows(i -> (exp = exp(i), log = log(i)), 1:5)IndexedTables.ncols — Functionncols(itr)Returns the number of columns in itr.
Examples
ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2IndexedTables.ndsparse — Functionndsparse(keys, values; kw...)Construct an NDSparse array with the given keys and values columns. On construction, the keys and data are sorted in lexicographic order of the keys.
Keyword Argument Options:
agg = nothing– Function to aggregate values with duplicate keys.presorted = false– Are the key columns already sorted?copy = true– Should the columns inkeysandvaluesbe copied?chunks = nothing– Provide an integer to distribute data intochunkschunks.- A good choice is
nworkers()(afterusing Distributed) - See also:
distribute
- A good choice is
Examples:
x = ndsparse(["a","b"], [3,4])
keys(x)
values(x)
x["a"]
# Dimensions are named if constructed with a named tuple of columns
x = ndsparse((index = 1:10,), rand(10))
x[1]
# Multiple dimensions by passing a (named) tuple of columns
x = ndsparse((x = 1:10, y = 1:2:20), rand(10))
x[1, 1]
# Value columns can also have names via named tuples
x = ndsparse(1:10, (x=rand(10), y=rand(10)))IndexedTables.pkeynames — Methodpkeynames(t::Table)Names of the primary key columns in t.
Examples
t = table([1,2], [3,4]);
pkeynames(t)
t = table([1,2], [3,4], pkey=1);
pkeynames(t)
t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2));
pkeynames(t)IndexedTables.pkeynames — Methodpkeynames(t::NDSparse)Names of the primary key columns in t.
Example
x = ndsparse([1,2],[3,4])
pkeynames(x)
x = ndsparse((x=1:10, y=1:2:20), rand(10))
pkeynames(x)IndexedTables.pkeys — Methodpkeys(itr::IndexedTable)Primary keys of the table. If Table doesn't have any designated primary key columns (constructed without pkey argument) then a default key of tuples (1,):(n,) is generated.
Example
a = table(["a","b"], [3,4]) # no pkey
pkeys(a)
a = table(["a","b"], [3,4], pkey=1)
pkeys(a)IndexedTables.reducedim_vec — Methodreducedim_vec(f::Function, arr::NDSparse, dims)
Like reduce, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.
IndexedTables.reindex — Functionreindex(t::IndexedTable, by)
reindex(t::IndexedTable, by, select)Reindex table t with new primary key by, optionally keeping a subset of columns via select. For NDSparse, use selectkeys.
Example
t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))
t2 = reindex(t, (:y, :z))
pkeynames(t2)IndexedTables.rename — Methodrename(t, map::Pair...)For each pair col => newname in map, set newname as the new name for column col in t. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], names=[:t, :x])
rename(t, :t => :time)IndexedTables.rows — Functionrows(itr, select = All())Select one or more fields from an iterable of rows as a vector of their values. Refer to the select function for selection options and syntax.
itr can be NDSparse, StructArrays.StructVector, AbstractVector, or their distributed counterparts.
Examples
t = table([1,2],[3,4], names=[:x,:y])
rows(t)
rows(t, :x)
rows(t, (:x,))
rows(t, (:y, :x => -))IndexedTables.select — Methodselect(t::Table, which::Selection)Select all or a subset of columns, or a single column from the table.
Selection is a type union of many types that can select from a table. It can be:
Integer– returns the column at this position.Symbol– returns the column with this name.Pair{Selection => Function}– selects and maps a function over the selection, returns the result.AbstractArray– returns the array itself. This must be the same length as the table.TupleofSelection– returns a table containing a column for every selector in the tuple.Regex– returns the columns with names that match the regular expression.Type– returns columns with elements of the given type.
Examples:
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
# select the :x vector
select(t, 1)
select(t, :x)
# map a function to the :y vector
select(t, 2 => abs)
select(t, :y => x -> x > 0 ? x : -x)
# select the table of :x and :z
select(t, (:x, :z))
select(t, r"(x|z)")
# map a function to the table of :x and :y
select(t, (:x, :y) => row -> row[1] + row[2])
select(t, (1, :y) => row -> row.x + row.y)IndexedTables.selectkeys — Methodselectkeys(x::NDSparse, sel)Return an NDSparse with a subset of keys.
IndexedTables.selectvalues — Methodselectvalues(x::NDSparse, sel)Return an NDSparse with a subset of values
IndexedTables.stack — Methodstack(t, by = pkeynames(t); select = Not(by), variable = :variable, value = :value)`Reshape a table from the wide to the long format. Columns in by are kept as indexing columns. Columns in select are stacked. In addition to the id columns, two additional columns labeled variable and value are added, containing the column identifier and the stacked columns. See also unstack.
Examples
t = table(1:4, names = [:x], pkey=:x)
t = pushcol(t, :xsquare, :x => x -> x^2)
t = pushcol(t, :xcube , :x => x -> x^3)
stack(t)IndexedTables.summarize — Functionsummarize(f, t, by = pkeynames(t); select = Not(by), stack = false, variable = :variable)Apply summary functions column-wise to a table. Return a NamedTuple in the non-grouped case and a table in the grouped case. Use stack=true to stack results of the same summary function for different columns.
Examples
using Statistics
t = table([1, 2, 3], [1, 1, 1], names = [:x, :y])
summarize((mean, std), t)
summarize((m = mean, s = std), t)
summarize(mean, t; stack=true)
summarize((mean, std), t; select = :y)IndexedTables.table — Functiontable(cols; kw...)Create a table from a (named) tuple of AbstractVectors.
table(cols::AbstractVector...; names::Vector{Symbol}, kw...)Create a table from the provided cols, optionally with names.
table(cols::Columns; kw...)Construct a table from a vector of tuples. See rows and Columns.
table(t::Union{IndexedTable, NDSparse}; kw...)Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.
table(x; kw...)Create an IndexedTable from any object x that follows the Tables.jl interface.
Keyword Argument Options:
pkey: select columns to sort by and be the primary key.presorted = false: is the data pre-sorted by primary key columns?copy = true: creates a copy of the input vectors iftrue. Irrelevant ifchunksis specified.chunks::Integer: distribute the table. Options are:Int– (number of chunks) a safe bet isnworkers()afterusing Distributed.Vector{Int}– Number of elements in each of thelength(chunks)chunks.
Examples:
table(rand(10), rand(10), names = [:x, :y], pkey = :x)
table(rand(Bool, 20), rand(20), rand(20), pkey = [1,2])
table((x = 1:10, y = randn(10)))
table([(1,2), (3,4)])IndexedTables.unstack — Methodunstack(t, by = pkeynames(t); variable = :variable, value = :value)Reshape a table from the long to the wide format. Columns in by are kept as indexing columns. Keyword arguments variable and value denote which column contains the column identifier and which the corresponding values. See also stack.
Examples
t = table(1:4, [1, 4, 9, 16], [1, 8, 27, 64], names = [:x, :xsquare, :xcube], pkey = :x);
long = stack(t)
unstack(long)IndexedTables.update! — Methodupdate!(f::Function, arr::NDSparse, indices...)
Replace data values x with f(x) at each location that matches the given indices.
IndexedTables.where — Methodwhere(arr::NDSparse, indices...)
Returns an iterator over data items where the given indices match. Accepts the same index arguments as getindex.
StatsBase.transform — Methodtransform(t::Table, changes::Pair...)Transform columns of t. For each pair col => value in changes the column col is replaced by the AbstractVector value. If col is not an existing column, a new column is created.
Examples:
t = table([1,2], [3,4], names=[:x, :y])
# change second column to [5,6]
transform(t, 2 => [5,6])
transform(t, :y => :y => x -> x + 2)
# add [5,6] as column :z
transform(t, :z => 5:6)
transform(t, :z => :y => x -> x + 2)
# replacing the primary key results in a re-sorted copy
t = table([0.01, 0.05], [1,2], [3,4], names=[:t, :x, :y], pkey=:t)
t2 = transform(t, :t => [0.1,0.05])
# the column :z is not part of t so a new column is added
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
transform(t, :z => [1//2, 3//4])