August 4, 2016
Version 2016.08.04.0 Released

MLDB is the Machine Learning Database. It’s the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We’re happy to announce the immediate availability of MLDB version 2016.08.04.0.

This release contains 161 new commits, modified 290 files and fixes 82 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New DISTINCT ON clause

The DISTINCT ON clause can be used to to filter out duplicate rows based on the value of an expression. The syntax is as follows:

SELECT DISTINCT ON (algorithm, project) algorithm, project, date
FROM ml_experiments
ORDER BY algorithm, project

This will return one row per unique value of the columns algorithm and project.

See the Select Expression documentation for more details.

New try builtin function

When an error occurs when processing a query, the whole query fails and no result is returned, even if only a single line caused the error. The new try function is meant to handle this type of situation. The first argument is the expression to try to apply. The optional second argument is what will be returned if an error is encountered.

In the example below, since the string foo will not parse as valid JSON, the row expression {'error': 1} will be returned instead:

SELECT try(parse_json('foo'), {'error': 1}) AS *

Check out the try function documentation for more details.

Deep learning

Added support for NVIDIA CUDNN, improving the performance of MLDB’s Tensorflow integration on GPUs. This is another step in making MLDB the easiest platform to use to run Tensorflow graphs.

Updated pymldb to version 0.7.0

The pymldb library is an open-source pure-Python module which provides a wrapper library that makes it easy to work with MLDB from Python. Version 0.7.0 adds support for passing in a JSON payload in GET requests. This is necessary when passing in big feature vectors to MLDB functions.

Check out the Using pymldb Tutorial notebook for more info.

Internal hashing is now done using HighwayHash

MLDB’s hash functions now use the Highway Tree Hash, which is claimed to be both likely secure and very fast. This will improve the speed of working with large numbers of columns.

Other changes and fixes

  • New aggregators: vertical_stddev (alias of stddev) and vertical_variance (alias of variance).
  • The classifier.experiment procedure now returns the ID of the scorer function it creates for each fold. This makes it easier to reuse the functions in later steps of a script.
  • The runOnCreation arguments present for all procedures now defaults to True, which was the value used by the vast majority of users.
  • The export.csv procedure has a new skipDuplicateCells which, when set to True, will skip rows that contain cells with many values. This is necessary because the CSV format cannot represent many values per cell the way MLDB datasets can by using the time dimension. More information is available on the export.csv procedure’s documentation.
  • Fixed a loss of precision for floats when using MLDB’s Python layer.
  • The arguments of the tokenize function, the import.text procedure and the tokensplit function are now all camel-case.
  • The tabular dataset is more efficient in storing numbers and timestamps, leading to a reduction in memory usage.
  • The speed at which the sparse.mutable dataset can record rows has been improved.
  • Fixed a memory leak in the levenshtein_distance() built-in function.
  • Fixed NULL propagation for math operators. Example: 5+NULL = NULL.