|
Posted
about 13 years
ago
by
rengolin
See the Introduction for this article here.
This part of the tutorial refers to the commit bellow:
Git commit: DATASET (N, transform(COUNTER))
https://github.com/hpcc-systems/HPCC-Platform/pull/1285/files
This step has many sub-steps, but since we
... [More]
must keep consistency, we can't just add features in one place (say, the parser) without adding it to the rest. So, this pull request changes all places needed to add a new syntax, although a syntax that doesn't add new activities (so, no changes in the engines).
Feel free to have a peak at the code now and realise that there are not many changes to do. The problem, however, is to know where and how to change. At hindsight, it always looks simple.
The Parser
Finding the Place
The parser file, as any Yacc file, is very hard to read, understand the connections between tokens and even harder to make changes without breaking it. Anyone that has even tried to change a Yacc file will tell you the same, so don't worry if the first 10 attempts to change it fails miserably.
The first step is to add it to the Parser, so new ECL can be transformed into an expression tree inside the compiler. To do so, we must open the 'hqlgram.y' file and look for similar syntax.
Since ECL syntax, like in most languages, uses parenthesis to aggregate arguments, you have to look for "DATASET '('" pattern to see where the other DATASET constructs are declared. Remember, we're trying to find similar syntax to be able to reuse or copy code from other constructs to our own, so the closer we get to what we want, the better.
The first matches (from the top of the file) won't give you the right place. Since DATASET() has many uses, we need to skim over the false-positives. Ex:
DATASET '(' recordDef ')' only declares a potential dataset from a record
DATASET '(' dataSet ')' same, but using another dataset's record
VIRTUAL DATASET '(' recordDef ')' nothing in there...
DATASET '(' recordDef childDatasetOptions ')' only adding a field...
Then, around line (8170), you get to:
DATASET '(' thorFilenameOrList ',' dsRecordDef ...
That's big, ugly and doing a lot of things, lets see what it is.
thorFilenameOrList is an expression, that's promising...
Initially, an 'expression' was used, but 'thorFilenameOrList' (maybe not the best name) can cope other features (for example, initialization with lists of rows, file names and expressions), that removes the ambiguity of the parser.
So that's a good place to start looking at the code inside and see what it does. If you look at each implementation, you'll see how they manipulate data and how they convert the ECL code into the abstract syntax tree (AST). Example:
origin = $3.getExpr();
attrs = createComma(createAttribute(_origin_Atom, origin), $8.getExpr());
The third token gets into the 'origin' expression, 'attrs' is a "comma" (ie. a list of expressions) containing an "origin" Atom referencing 'origin' and the 8th token in the list.
Atoms are identifiers, or known properties. That code is saying that it knows a thing called "origin" and the associated expression is 'origin'. This is used by the tree builder and optimiser later on to identify features, filter and common up expressions based on those attributes.
Adding a new DATASET
Now it's time to think what we need to do. We need to create a dataset, from a transform, using a counter. Since you've got a counter to handle, we need to think that are the consequences of having one. Go to the language reference doc and look for other constructs that have counters, and you'll find many, for example, NORMALIZE. Searching for "NORMALIZE '('" in the parser file, you'll find a curious token:
beginCounterScope / endCounterScope
Scopes of counters are necessary when you have nested transforms and each one has its own counter. You can't mix them, or you'll get wrong results (ie. an internal counter being updated by an external increment). You'll also see that the "counter" object is bound to the end of scope token.
A counter is not a number, but a reference to a number source. In our case, the "count" is the number source, which can be anything from a constant to an expression to a reference to an external object. Following the NORMALIZE case, you get to something like this:
DATASET '(' thorFilenameOrList ',' beginCounterScope transform endCounterScope
That represents a DATASET(expr, TRANSFORM(COUNTER)). Which can be implemented as:
IHqlExpression * counter = $7.getExpr(); // endCounterScope
if (counter)
counter = createAttribute(_countProject_Atom, counter);
As you have seen in the NORMALIZE case, you must name it a "counter", by creating an attribute with a count Atom to it. The name "countProject" is maybe not the best, though.
$$.setExpr(
createDataset(
no_dataset_from_transform,
$3.getExpr(),
createComma(
$6.getExpr(),
counter)));
With this you create a dataset, naming the operation "no_dataset_from_transform" (that needs to be added to the list of operations, you'll see later), with the counter source expression ($3) and a "comma" with the transform ($6) using a counter. The comma is necessary because the createDataset() accepts a list of operands.
If you follow the NORMALIZE and other DATASET examples, you'll see that you also need to normalise the numeric input, to make sure it's a valid integer:
parser->normalizeExpression($3, type_int, false);
and update the parser's context position to the current expression:
$$.setPosition($1);
However, just adding that code created ambiguities on the other DATASET constructs. This is because of the counter object conflicting with other declarations. One way to fix it (yacc style) was to add the counter object to all conflicting constructs and introduce an error message. So, we're moving the fact that counters are not meaningful on those other constructs, from a syntax error to a semantic error. Check the pull request on 'hqlgram.y' for more information on that.
The AST
Now that we have a node in our AST being built by the parser, we need to support it. The first thing you do is to the node operator's enum, in 'hqlexpr.hpp'. Find the first 'unusedN' item and replace it with your own:
no_some_other,
no_dataset_from_transform,
unused10,
Unused items were once real nodes that got removed from the language. To keep backward compatibility with old compiled code, we keep the same order and id of most of our enums. Adding a new node at the end would also work, but would leave a huge trail of unused ones in the middle.
You can now try to compile your code and execute the compiler regression[1] to make sure you haven't introduced any new bug. If all is green, create an example of what you expect to see in ECL:
r t(unsigned value) := transform
SELF.i := value;
end;
ds := DATASET(10, t(COUNTER));
OUTPUT(ds);
This should create a dataset with items from 1 to 10. If you run this example through your code, you'll see that it'll fail in multiple places. That's because the compiler has no knowledge of what to do with your node. One way to do that is to keep running that code through your compiler and fixing the failures (ie. adding no_dataset_from_transform cases to switches with appropriate code).
What is appropriate code? Well, that varies. If you see in the pull request, there are many places where the new node was added, and each had a different case. Examples:
@ bool definesColumnList(IHqlExpression * dataset)
case no_dataset_from_transform: return true;
@ const char *getOpString(node_operator op)
case no_dataset_from_transform: return "DATASET";
@ childDatasetType getChildDatasetType(IHqlExpression * expr)
case no_dataset_from_transform: return childdataset_none;
How do you know these things? Normally, it's either obvious or very hard to answer. Putting the wrong value, in this case, won't do much for the other constructs (since we're restricting to no_dataset_from_transform), but it may give you the wrong sensation of success. It might work for the cases you have predicted, but it might fail for others, or simply be wrong.
Our suggestion is to use a mix of trying for yourself and asking on the list, but rest assured that, if you ever get right results with bad code, we should be able to spot it on pull request reviews. Adding comments to the code, to make that task easier is a work in progress.
One quick example of ECL's AST folding (more to come on next step), is to fold null datasets. In the pull request above, check the file 'hqlfold.cpp'. It's checking for a zero sized count (via 'isZero', which checks if the expression evaluates to zero, not if the value is a constant zero), and replaces it with a null expression (via 'replaceWithNull').
The Activity
The AST will be translated into activities. These activities are coded in C++ by the compiler's back-end, collected into a driver that will execute the graphs in order and compiled again to a shared object. Each shared object is a workunit, that are executed in the HPCC engines (Roxie, Thor).
Each activity has a helper, and that's the class you'll have to implement from the AST node, so the activities in the engines can use it to execute your code. You need to find out what existing activity (if there is one) maps to the same functionality as your node. In our case, we're very lucky that there is one activity perfect for this job, but the question is, how to find it?
All activities implement the IHThorActivity interface, so you can consult the 'ecl/hthor/hthor.hpp' file and list all base classes of that interface (aka. pure virtual struct). I assume your IDE does that for you, if not, 'hthor.cpp' and 'hthor.ipp' will give you plenty of material to read.
If you look thoroughly enough, or ask an experienced developer, you'll find out that 'CHThorTempTableActivity' is the activity you're looking for. The reason being that this activity builds a new dataset from scratch (ie. it's a source) and it does so in a simple way. This activity is generated by the compiler and also the syntax:
DATASET(my-set, { single-fielded-record }).
If there is no activity that could possibly be mapped to this new AST node, you will have to create one and make all engines implement it. Choosing which one to derive from follows the same logic described above, though.
Now that we know what activity we're aiming towards, we'll try to export our AST node to it, so it'll automatically be executed by all engines. To do so, you must add a new 'BuildActivity' in the ECL-to-CPP translator. The translator is implemented by the class 'HqlCppTranslator' in 'ecl/hqlcpp/hqlhtcpp.cpp'.
You need to add a hook in 'buildActivity' to call your function when the case is a no_dataset_from_transform. Following the convention in the rest of the class, we called it 'doBuildActivityCountTransform'. Your new function will probably accept the context and the expression node.
Building an Exporter
An exporter function will generate a helper class, that derives from your base object implemented from the interface named above. Our activity's helper is an 'IHThorTempTableArg' which is partially implemented by 'CThorTempTableArg'.
Your class will have to base 'CThorTempTableArg' to re-use its generic methods and to have the link-counted logic that all classes do. The main methods that 'CThorTempTableArg' haven't implemented from 'IThorTempTableArg' are:
size32_t getRow(ARowBuilder & rowBuilder, unsigned row);
unsigned numRows();
'numRows' will return the number of rows, so the activity can stop at the right time or allocate the right amount of memory beforehand. 'getRow' will return the next row in line, and in our case, update the internal counter.
So, first we need to create a "TempTable" activity instance:
Owned instance =
new ActivityInstance(*this, ctx, TAKtemptable, expr,"TempTable");
buildActivityFramework(instance);
buildInstancePrefix(instance);
and override the functions we need. See that we're passing the context (where in the resulting C++ file we're writing to) to the activity, so our instance has its own context. From now on, we have to re-use the context of the activity to write any code in, otherwise it'll be written on the outside of the class and we would get a compilation error.
A simple function is to check if the activity is constant. Normally it is, for temp tables, so the default behaviour is true. But in our case, the transform might not be constant (depend on external factors which we can't predict), so we need to change it accordingly:
// bool isConstant() - default is true
if (!isConstantTransform(transform))
doBuildBoolFunction(instance->startctx, "isConstant", false);
See that we're using a 'doBuildBoolFunction' function, which is a wrapper to create a new function that returns a boolean from the result of the expression, in our case, "false".
This means, we only override the function on those activities that really need and that we know at compile time. If we didn't know it, we would have to pass an expression to be evaluated, and override that function every time. Of course, compile-time evaluation is always better, so in this case, we use it.
Overriding 'numRows' is also simple, just by passing the count (whether an expression or a constant), and the wrapper will take care of the rest:
// unsigned numRows() - count is guaranteed by lexer
doBuildUnsignedFunction(instance->startctx, "numRows", count);
However, 'getRow' is somewhat more complicated. We have to build an expression to account for the counter's increment, but also make sure we keep it attached to the counter object (that, if you remember, is just a reference to a number), so the transform can use it. We have to do the function building ourselves.
Start a new context (inside the activity's context):
BuildCtx funcctx(instance->startctx);
Add a function declaration (this could be more automated):
funcctx.addQuotedCompound("virtual size32_t getRow(ARowBuilder & crSelf, unsigned row)");
ensureRowAllocated(funcctx, "crSelf");
And bind the cursor's selector to the counter object ('row' is the current id):
BoundRow * selfCursor = bindSelf(funcctx, instance->dataset, "crSelf");
IHqlExpression * self = selfCursor->querySelector();
associateCounter(funcctx, counter, "row");
And finally, build a transform's body using the bound counter (self):
buildTransformBody(funcctx, transform, NULL, NULL, instance->dataset, self);
With this, your class will be exported as overriding a CHThorTempTableActivity's helper (CThorTempTableArg), so whenever that node of the graph is executed in any engine, the workunit (shared object calling graphs to execute) will pass it to the engine, which will use your methods to build a dataset.
The Test
Now that we have the basic functionality working, we need to make sure we will be able to handle all new cases we're considering (and hopefully make sure we fail where we should fail). To do that, add a test case with the possible syntax you'll expect to work, and one with the things you expect will fail (note that the pull request mentioned doesn't have this!).
If all of them compile, you're on the right track. But you have to investigate the generated C++ code to make sure it's doing what you expect it will. Since we are only re-using an activity, the activity you generate has to be similar to other implementations of the same activity. To check that, check the other files that implement the 'CHThorInlineTableActivity' and compare.
You need to be careful with the exact code. We want 'numRows' to reflect the uncertainty passed via ECL, on stored variables, for example. We want the 'getRow' to make sure it's updating the counter every time it returns a row and that it does so at the right time. See:
Git commit: Dataset count range starts at 1
https://github.com/hpcc-systems/HPCC-Platform/pull/2165/files
The pull request above fixes a bug where the original code assumed the rows started at zero, when in ECL they actually start at one.
If you are lucky enough to not have to add activities with your changes, you can also add tests to the regression suite[2] to make sure the output of your new class is in sync to what you expected in the first place, possibly using the same (or a similar) test as you used in the compiler regression.
Once you're happy with the output, no other test fails and the failures on your new construct are being caught on your negative test, you're ready to submit this pull request.
References
[1] The compiler regression suite is a diff-based comparison, where you run it with a clean top-of-tree version (of the branch you're targeting to) and run again with your changes and diff the results (logs, XML of intermediate code and resulting C++ files).
A tutorial on how to run the regressions and how to interpret the differences (which can be daunting, sometimes) is on the way of being created. Please, refer to 'ecl/regress' directory and the regression scripts 'regress.sh' or 'regress.bat' in it for more information in the meantime.
[2] The regression suite is a set of tests that compiler and execute code on all three Engines (Thor, Roxie and HThor). Please refer to 'testing/ecl' directory for more information.
A tutorial on how to run the regression suite (not the same as the compiler regressions above) should take a while, since the underlying technology is changing. Ask on the mailing list for more info.
Follow-up
The next step on the tutorial is Step 2: The Distributed Flag, and Execution Tests. [Less]
|
|
Posted
about 13 years
ago
by
rengolin
This tutorial will walk you though adding a new feature in the compiler, making sure it executes correctly in the engines, and performing some basic optimisations such as replacing and inlining expressions.
When adding features to the compiler, there
... [More]
are two main places where you have to add code: the compiler itself, including the parser, the expression builder and exporter, and the engines (Roxie, Thor and HThor), including the common graph node representation.
You need to make sure all possible variations of your new construct will work, not only by itself, but in conjunction with other features of ECL, by creating exhaustive tests on both compiler and regression suites.
Finally, we'll see how to add flags, optimise another query into your optimal new construct and allow them to be exported inlined.
The aim to this text is to appear as a PDF document to guide people changing the ECL compiler, but I have decided to post it in full on the blog, as a request for comments as well as providing early access to it.
The Feature
This walk-through is based on the implementation of:
DATASET(count, TRANSFORM(..., COUNTER, ...))
This DATASET syntax will execute the TRANSFORM 'count' times, passing it as a parameter to it, where numerical fields are expected, to build incremental datasets. This feature is useful for creating test tables, where the data is used to test other features, or accessory tables, when joined with other tables could help you organise them.
There was another syntax that is used to achieve the same functionality, when the dataset had only one ROW:
NORMALIZE(dataset, count, TRANSFORM(..., COUNTER, ...))
This syntax is not clear to what is its intentions and sometimes required the creation of a dummy dataset, which made code less readable. We also wanted to make that operation distributed across the nodes, and to do so on a syntax that is already known and complex (like NORMALIZE, with so many other uses) was more complex than on a new syntax.
So, it was clearer (and easier) to add a new simple (and meaningful) syntax, get NORMALIZE to optimise to it on certain conditions, and distribute the DATASET.
We'll follow the commits in Github as a real-world annotated walk-through on how to implement new features in the compiler, new activities in the engine and provide a way to test it. It might not be the optimum path, but it is a real one and will help you understand the kind of problems we try to solve and how we do it in the wild.
Each step will be referenced by its pull request in GitHub, so you can refer to them as a complement of this tutorial.
The Files
All compiler files are within the directory 'ecl/hql' in the source tree, including the parser, tree builder, optimisers and exporters. You'll add your new feature on those files, and you'll need some tests under 'ecl/regress' to make sure the compilation part of the process is sane.
We use bison to generate our parser from Yacc files. The main file holding the whole grammar is 'hqlgram.y'. This file contains all definitions, reserved keywords and general structure of the language. 'hqlexpr.cpp' is the core of the tree builder, while 'hqlopt.cpp' and 'hqlfold.cpp' are the main optimisers, the former for general optimisations and the latter mostly for folding expressions.
Roxie activity files under `roxie/ccd`, Thor's under `thorlcr/activities' and HThor's under 'ecl/hthor'. Those files need to be changed if you're adding not only a new syntax (ie. a different way of performing the same activity), but also a new activity, or at least, changing the way the activity is executed.
The contents of this tutorial are expanded into the next four posts:
Step 1: The Parser, The Expression Tree and the Activity.
Step 2: The Distributed Flag, and Execution Tests.
Step 3: The Optimisation, and More Tests.
Step 4: Inlining and Conclusion. [Less]
|
|
Posted
over 13 years
ago
by
flavio
As I was preparing the Keynote that I delivered at World-Comp'12, about Machine Learning on the HPCC Systems platform, it occurred to me that it was important to remark that when dealing with big data and machine learning, most of the time and effort
... [More]
is usually spent on the data ETL (Extraction, Transformation and Loading) and feature extraction process, and not on the specific learning algorithm applied. The main reason is that while, for example, selecting a particular classifier over another could raise your F score by a few percentage points, not selecting the correct features, or failing to cleanse and normalize the data properly can decrease the overall effectiveness and increase the learning error dramatically.
This process can be especially challenging when the data used to train the model, in the case of supervised learning, or that needs to be subject to the clustering algorithm, in the case of, for example, a segmentation problem, is large. Profiling, parsing, cleansing, normalizing, standardizing and extracting features from large datasets can be extremely time consuming without the right tools. To make things worse, it can be very inefficient to move data during the process, just because the ETL portion is performed on a system different to the one executing the machine learning algorithms.
While all these operations can be parallelized across entire datasets to reduce the execution time, there don't seem to be many cohesive options available to the open source community. Most (or all) open source solutions tend to focus on one aspect of the process, and there are entire segments of it, such as data profiling, where there seem to be no options at all.
Fortunately, the HPCC Systems platform includes all these capabilities, together with a comprehensive data workflow management system. Dirty data ingested on Thor can be profiled, parsed, cleansed, normalized and standardized in place, using either ECL, or some of the higher level tools available, such as SALT (see this earlier post) and Pentaho Kettle (see this page). And the same tools provide for distributed feature extraction and several distributed machine learning algorithms, making the HPCC Systems platform the open source one stop shop for all your big data analytics needs.
If you want to know more, head over to our HPCC Systems Machine Learning page and take a look for yourself.
Flavio Villanustre [Less]
|
|
Posted
over 13 years
ago
by
rengolin
HPCC's distributed file system has the concept of SuperFiles, a
collection of files with the same format, that is used to aggregate
data and automate disk reads.
The operations you can perform on a SuperFile are the usual for every
file (create
... [More]
, remove, rename) and every collection (add/remove
children, etc.). And with that, the concept of transactions become
very important. If you're adding three subfiles and the last one
fails, you want to clean up the first two from it. More importantly,
if you are deleting files and one fail, you want them back, so the
user can try again, maybe with a different strategy. All in all, what
you don't want is to loose data.
How files are handled
In the DFS, files are tree nodes (much like Inodes) with certain
properties, in a Dali server (our central file server). The actual
data is spread over Dali slaves (FileParts) to maximise IO efficiency.
However, Dali controls much more than just files, it controls
information and how we access it. Multiple queries can simultaneously
access a file to read, but once a file has to be written, all other
queries must stop and wait. But because most file usage is temporary,
the requirement is that write-locks can only be given if there is no
other read-lock on that file.
While this is true for Thor (our data-crunching engine), it's not for
Roxie (our fast-query engine), so some queries can fail in Roxie that
would work otherwise in Thor.
Also, when dealing with multiple files at the same time, you end up
locking them all, stopping you from ever getting a write-lock. If
you're read-locking the same C++ object several times, then you can
change it to a write lock, but if two different objects (on the same
thread, on the same logical operation) have a read-lock, you're stuck.
Making sure you have the same objects when dealing with the same files
on the same concept is no easy task, so problems like these were dealt
with by changing the properties directly when you were sure you could.
That led to a bloat in code (multiple repetitions of slightly similar
code), and multiple types of locks (lockProperties, lockTransaction)
that would do the same thing, only differently (if that makes any
sense).
Transactions
If you back-track and analyse what a transaction is, you can see that
it solves most of the problems above. First, a transaction is an
atomic operation, where either all or nothing happens. That was
already guaranteed by the current transactional model (albeit with
some bloating). But a transaction has to be protected from the outside
world and vice-versa. If you create a file within a transaction, the
file must only exist in your transaction. If you delete a file, it can
only be physically deleted only if the transaction is successful. So,
creation and deletion of files must also be done within transactions.
Transactions also provide us with a very clear definition of a
process. A process is whatever happens within a transaction. This is
very simple, but very powerful, because now we can safely say that all
objects referring to the same file in a transaction *must* be the
same, AND all objects referring to the same file on *different*
transactions must be different. And, since transactions have their
file cache already, that's the cure for rogue read-locks preventing
write-locks.
Current Work
The work that has being done for the last few months has cleaned up a
lot of the bloat, duplication and has migrated more file actions into
transactions. Also, removed the handling or properties directly
(rather than using the file API), and so on. But there's still a lot
to do to get to a base where transactions can become first-class
citizens.
The short-term goal is to make every file action to happen as part of
a transaction, but we can't enforce all other parts of the system to
use transactions, so we had to create some temporary local transaction
to DFS functions to cover up for the loss. We could have changed the
rest to use transaction, but since not all actions are performed
within transactions, that would lead us to even more confusion.
So, until we have all file actions within transactions, each action
will have it's own local transaction created, if none was provided.
Actions will be created, executed and automatically committed, as if
they were part of a normal transaction. That adds a bit of bloat on
its own, but once all actions are done, each code will be very simple:
...Some validation code...
Action *action = createSomeAction(transaction, parameters);
transaction->autoCommit();
The auto-commit will only commit if the transaction is inactive.
Otherwise, we'll wait until the user calls for "commit" in her code.
Simple as that.
Long-term goals
The long-term goal is to simplify file access to a point where the API
look as clean as:
{
DFSAccess dfs; // rollback on destructor
dfs.setUser(user);
dfs.setCurrentDir("mydir");
dfs.createSuperFile("a");
dfs.addSubFile("a", "x");
dfs.addSubFile("a", "y");
dfs.addSubFile("a", "z");
dfs.removeFile("k");
dfs.commit();
}
or even simpler:
if (!AutoCommitDFSAccess(user).
createSuperFile("a").
addSubFile("a","x")) { // commits on destructor
throw(something);
}
DFSAccess is an object that know who you are, where you are and what
you're doing. It uses the "user" object to access restricted objects,
it has an intrinsic transaction that starts on the constructor and
rollsback on the destructor, unless you commit first. Of course, you
can start and stop several transactions within the life-time of the
object, and even keep it safe as member of another class, or as a
global pointer.
It doesn't matter how you use, it should do the hard work for you if
you're lazy (or just want a quick access), and provide you with
complete control over the file-system access if you desire. That means
hiding *all* property manipulation, making sure the right logic goes
into the right place, without the need of refactoring the whole
platform, since everyone else will be using the DFSAccess API.
The main ideas to provide a simple, clean API, that makes it clear
what you're doing are:
Simplify DFS calls, ie. move file-system code up the API,
Protect file properties and Dali locks from non-DFS code,
Remove the concept of transaction to the user, unless they really want it,
Enforce the use of transactions on *every* DFS action, even if the
user doesn't need it,
Provide some control over transactions if the user *really* needs it.
URI naming system
Another long-term goal (that is becoming shorter and shorter), is to
allow URI-based file name resolution. There are far too many file
naming styles in HPCC, and most of them can be used interchangeably.
For instance, "regress::mydir::myfile" is a Dali path, but if you're
resolving files locally (ie. no Dali connected), it transforms itself
into a local file. This is a powerful feature, I agree, but what
happens if I was expecting to get a Dali file, and there is a local
file that is older than the one in Dali (with the wrong contents)? The
program will not fail, as it should. Debugging problems like these
take time and produce a lot of grey hair.
The idea is to use URIs to name files. So, if you use a generic name
like "hpcc:///regress/mydir/myfile", it can mean anything, including
local files. If you specify that this needs to be a Dali file like
"hpcc://mydali/regress/mydir/myfile", than it'll fail if Dali is not
connected. Local files can be named as "file:///var/local/foo/file"
pretty much the same way other URI work. Web files can also be opened
(read-only) using normal URLs.
We're also thinking of adding more complex logic to the resolution of
files. For instance, as of today, HPCC (master) has the ability to
deal with Git files (files in a local Git repository), and archived
files (zip, tar, etc). So, we can treat the URI as telling us what
type each file is, if the extension is not obvious enough. Both
"hpcc://mydali/mydir/myfile.zip" and
"file:///home/user/code/.git/dir/whatever/file" can automatically be
recognized as Zip and Git files, but also would
"hpcc://mydali/mydir/myfile?format=zip", for instance.
In order to do that, the file resolution has to be united under
another API, with logic to orthogonally resolve the different
protocols (hpcc, http, file, etc) and file types (ecl, zip, git, xml,
csv, etc). But that is enough material for a separate post in itself. [Less]
|
|
Posted
over 13 years
ago
by
rtaylor
How does one get started writing a Blog?
Flavio suggested to me recently, "You should write a blog about ECL to give the community an additional resource to learn more about it." So I said, "OK, I know quite a bit about ECL, so what specifically are
... [More]
you suggesting I write about?" And he replied, "Machine Learning."
Well, writing about Machine Learning would be great, if I actually knew anything about it, but I don't. So here's what I propose to do: post a series of articles chronicling my "adventures" in learning to do Machine Learning in ECL.
First off, I needed a resource to learn from, so I bought my first book on Machine Learning:
"Machine Learning for Hackers" by Drew Conway and John Myles White (O’Reilly)
Copyright 2012 Drew Conway and John Myles White
ISBN: 1449303714
This excellent (and highly recommended) book presents the topic from a programmer's perspective that I can understand -- explaining what Machine Learning is used for and why, by way of a ton of example code that shows you exactly how to get it all done. My only "problem" is that all the code examples in the book are in a language called R that I am completely unfamiliar with.
The authors said that R was developed by statisticians for statisticians, and I am definitely NOT a statistician, so my plan is to translate their R example code into ECL and run it against their example data to see if I can duplicate their result. Then I'll discuss in each article how the ECL version accomplishes the same task as the authors' R code, except it will be running on our massively parallel processing platform (HPCC) instead of a single Windows box (as it will do when I run their R code to see what it does).
The ultimate purpose of this blog is to teach ECL in a practical manner to everybody, so I will reference the R example code by chapter and page number but will not be quoting it from the book. To anyone already familiar with R, these articles should help you to learn ECL -- I suggest you buy a copy of the book to see exactly the R code I'm translating. To everybody else -- I suggest you buy a copy of the book to read the discussion of the Machine Learning techniques the code demonstrates. Either way, it is well worth the investment.
Chapter One
This chapter revolves mostly around introducing R to anyone who is unfamiliar with it, so there are no specific Machine Learning techniques used. However, it does provide an interesting introduction to the R-to-ECL translation process. The authors have structured the book so that their introduction of R also introduces a number of standard database techniques (such as data standardization, cleansing and exploration) which is one of the major reasons I find their approach so intuitive, since my programming background brings me to Machine Learning from a database application development direction.
The example data used in this chapter is UFO sighting data that the authors acquired from the Infochimps.com website. I'm using the file that came with the R code download that I got from the OREILLY.COM website. It is a CSV-type file using the tab character ("\t") as the field delimiter.
Spraying the Data
The first bit of example code on page 14 simply reads the data file and displays the records. In ECL, before you can use a file you must first spray it onto your cluster then define the file. I sprayed the UFO data file to my cluster using the Spray CSV page in ECL Watch, specifying the \t Separator and giving it the name '~ml::ufodata' then defined the file like this:
EXPORT File_UFO := MODULE
EXPORT Layout := RECORD
STRING DateOccurred;
STRING DateReported;
STRING Location;
STRING ShortDescription;
STRING Duration;
STRING LongDescription{MAXLENGTH(25000)};
END;
EXPORT File := DATASET('~ml::ufodata',Layout,CSV(SEPARATOR('\t')));
END;
I put this code in my "ML_Blog" directory and named the code file "File_UFO" (the file name of the .ECL file storing the code must always be the same as the name of the EXPORT definition contained in that file, in this case "File_UFO") so I can easily reference it later. I used the MODULE structure to organize the RECORD structure and the DATASET definitions in the same code file
Each field is defined as a variable-length STRING to begin with, because we don't yet know how much data might be in any given field. Before spraying, I opened the file in a text editor to look at the data. I scrolled through the file to see how long the longest record was, which is how I determined the MAXLENGTH to apply to the final field (I could see that the rest of the fields were fairly short). I need to have the MAXLENGTH explicitly defined to override the 4K default that would have been if I had not. For the field names, I simply duplicated what the authors used in the second bit of example code at the bottom of page 14.
Next, I varied a bit from the script the authors proposed and did a bit of standard ECL-style data exploration to determine exactly what the size of each field should be. I opened a new builder window and ran this code:
IMPORT ML_Blog;
ML_Blog.File_UFO.File;
COUNT(ML_Blog.File_UFO.File); //61393
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(DateOccurred))); //8
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(DateReported))); //8
COUNT(ML_Blog.File_UFO.File( LENGTH(DateOccurred) <> 8 OR
LENGTH(DateReported) <> 8)); //254
OUTPUT(ML_Blog.File_UFO.File( LENGTH(DateOccurred) <> 8 OR
LENGTH(DateReported) <> 8),ALL);
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(Location))); //70
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(ShortDescription))); //954
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(Duration))); //31
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(LongDescription))); //24679
The IMPORT ML_Blog makes all my EXPORTed definitions in the ML_Blog directory available for use. The next line shows me the first 100 records in the file, then the COUNT tells me the total number of records (61,393 in the file I downloaded).
I wrote the two MAX functions followed by COUNT and OUTPUT because the authors alerted me to a possible data corruption error, based on the error message they got when they first tried to format the DateOccurred field into a standard R date field (page 15). The MAX functions both returned 8, so the file I downloaded must have been updated since the authors worked with it. However, the COUNT function tells me there are 254 records that do not contain a standard YYYYMMDD date string, so the OUTPUT shows me those records.
The ALL option on my OUTPUT allows me to see all the records (not just the first 100), so I can see that the data corruption is the presence of "0000" as the DateOccurred in 254 records. Since then number of "bad" records I found does not match the number the authors cited, I can only assume that the file download I got is a newer/better file than the authors worked with.
The rest of the MAX functions tell me how much data is actually contained in each field. Using the information gained, I can now edit the file definition to this:
EXPORT File_UFO := MODULE
EXPORT Layout := RECORD
STRING8 DateOccurred;
STRING8 DateReported;
STRING70 Location;
STRING ShortDescription{MAXLENGTH(1000)};
STRING31 Duration;
STRING LongDescription{MAXLENGTH(25000)};
END;
EXPORT File := DATASET('~ml::ufodata',Layout,CSV(SEPARATOR('\t')))(DateOccurred <> '0000');
END;
I left the two description fields as variable-length because that will allow the system to make the most efficient use of storage. There is no problem with mixing fixed and variable-length fields in any of the file formats (flat-file, CSV, or XML). I also added a filter to the DATASET definition to eliminate the records with bad dates.
Date Conversion and Filtering Bad Records
This next section deals with data cleansing and standardization. These are the kind of standard data operations that always need to be done in any/every data shop. This is one of the first steps taken in any operational database to ensure that you're not working with "garbage" data, and that the format of your data is the same in each record.
Since the authors are simply filtering out the corrupt records (data cleansing), I can do the same.
To re-format the date strings into the ISO basic format (%Y%m%d) that the authors use on page 15 (data standardization), I need to create a new recordset with the fields converted to an explicit date field format. For that, I need to use ECL's Standard Date Library functions, like this:
IMPORT $, STD;
ds := $.File_UFO.File;
Layout := RECORD
STD.Date.Date_t DateOccurred := STD.Date.FromString(ds.DateOccurred,'%Y%m%d');
STD.Date.Date_t DateReported := STD.Date.FromString(ds.DateReported,'%Y%m%d');
ds.Location;
ds.ShortDescription;
ds.Duration;
ds.LongDescription;
END;
EXPORT CleanedUFO := TABLE(ds,Layout);
This code is stored in a file named "CleanedUFO" (again, the file name of the .ECL file storing the code must always be the same as the name of the EXPORT definition contained in that file).
The re-definition of "$.File_UFO.File" to "ds" is done simply to make the rest of the code a little easier to read. The STD.Date.Date_t data type used here is just a re-definition of UNSIGNED4 to reflect that the contents of the binary field will be an integer comprised of the numeric value represented by a YYYYMMDD date. The STD.Date.FromString() function converts a YYYYMMDD date string to the UNSIGNED4 binary value, reducling the storage requirement by half, so that a "20120101" date becomes the integer value 20,120,101.
I ran a quick test of this code, but did nothing more with it because I can pretty easily combine this with the next two steps and get all the "work" done in one job.
Parsing the Location Field
The authors want to split the data in the Location field into separate City and State fields (data standardization again). There are a number of ways in ECL to accomplish that, but I chose to write a simple FUNCTION structure to handle that job and add it into the CleanedUFO code from above, like this:
IMPORT $, STD;
ds := $.File_UFO.File;
SplitLocation(STRING Loc) := FUNCTION
LocLen := LENGTH(Loc)+1;
CommaPos := LocLen - STD.Str.Find(STD.Str.Reverse(Loc),',',1);
RetLoc := MODULE
EXPORT STRING70 City := IF(CommaPos=LocLen,Loc,Loc[1..CommaPos-1]);
EXPORT STRING10 State := IF(CommaPos=LocLen,'',TRIM(Loc[CommaPos+1..],LEFT));
END;
RETURN RetLoc;
END;
Layout := RECORD
STD.Date.Date_t DateOccurred := STD.Date.FromString(ds.DateOccurred,'%Y%m%d');
STD.Date.Date_t DateReported := STD.Date.FromString(ds.DateReported,'%Y%m%d');
ds.Location;
ds.ShortDescription;
ds.Duration;
ds.LongDescription;
STRING70 City := SplitLocation(ds.Location).City;
STRING10 State := SplitLocation(ds.Location).State;
END;
EXPORT CleanedUFO := TABLE(ds,Layout);
In looking at the data, I noticed that the State value always comes last and is delimited from the City value by a comma. I also noticed that some Location field values had multiple commas, so I wrote my SplitLocation() function to specifically find the last comma and split the text at that point. I could have used the STD.Str.SplitWords() function from the ECL Standard Library to accomplish this part, but I decided that writing my own FUNCTION would provide a better teaching example for this Blog.
To find the last comma, my SplitLocation() function determines the length of the passed string and adds 1 to that. I'm adding 1 because I need to get the inverse value to determine the actual position of the last comma in the string. I'm reversing the string text using STD.Str.Reverse(), then using the STD.Str.Find() function to find the position of the first comma in the reversed string. Subtract that position from the length + 1 and voila: there's the actual position of the last comma.
The next "trick" I'm using here is making the FUNCTION RETURN a MODULE structure, to allow it to return multiple values. Usually, functions return only a single value, but this "trick" makes it possible to have as many return values as you need (in this case, two: City or State). Then within the MODULE structure I'm using the IF() function to determine the actual return values. In both cases, my IF condition is CommaPos=LocLen, which, if true, indicates that there was no comma found in the string. If no comma was found, then the City field gets the input returned and the State returns blank.
I'm then using my SplitLocation() function to populate two additional fields in my result recordset, just as the authors have done in their R code.
Limiting Data to US States
The last bit of cleaning and standardization is in the example code on page 18, where the authors define the set of valid US states and then filter the records so that only those records with valid state field values are included in the final result. Here's how I did that, onc again simply expanding on the CleanedUFO TABLE definition, like this:
IMPORT $, STD;
ds := $.File_UFO.File;
SplitLocation(STRING Loc) := FUNCTION
LocLen := LENGTH(Loc)+1;
CommaPos := LocLen - STD.Str.Find(STD.Str.Reverse(Loc),',',1);
RetLoc := MODULE
EXPORT STRING70 City := IF(CommaPos=LocLen,Loc,Loc[1..CommaPos-1]);
EXPORT STRING10 State := IF(CommaPos=LocLen,'',TRIM(Loc[CommaPos+1..],LEFT));
END;
RETURN RetLoc;
END;
Layout := RECORD
STD.Date.Date_t DateOccurred := STD.Date.FromString(ds.DateOccurred,'%Y%m%d');
STD.Date.Date_t DateReported := STD.Date.FromString(ds.DateReported,'%Y%m%d');
ds.Location;
ds.ShortDescription;
ds.Duration;
ds.LongDescription;
STRING70 City := SplitLocation(ds.Location).City;
STRING10 State := SplitLocation(ds.Location).State;
END;
USstates := ['AK','AL','AR','AZ','CA','CO','CT','DE','FL','GA','HI','IA','ID','IL',
'IN','KS','KY','LA','MA','MD','ME','MI','MN','MO','MS','MT','NC','ND',
'NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','RI','SC','SD','TN',
'TX','UT','VA','VT','WA','WI','WV','WY'];
EXPORT CleanedUFO := TABLE(ds,Layout)(State IN USstates) : PERSIST('PERSIST::CleandUFOdata');
I just added the USstates SET definition, then appended a filter to the end of the TABLE function to always limit the CleanedUFO recordset to the valid records. The addition of the PERSIST Workflow Service on the TABLE definition simply ensures that the work happens only the first time we use the CleanedUFO table.
So to see the result, I ran this code in a separate builder window:
IMPORT ML_Blog;
ML_Blog.CleanedUFO;
The result looks very similar to the author's result, shown on page 18.
Final Thoughts
That's enough for this article. We'll continue with the rest of this chapter in the next post. That's when we'll put this data to use, by exploring what we have, doing some data analysis, and creating visual representations of the results. [Less]
|
|
Posted
over 13 years
ago
by
flavio
More than 12 years ago, back in 2000, LexisNexis was pushing the envelope on what could be done to process and analyze large amounts of data with commercially available solutions at the time. The overall data size, combined with the large number of
... [More]
records and the complexity of the processing required made existing solutions non-viable. As a result, LexisNexis invented, from the ground up, a data-intensive supercomputer based on a parallel share-nothing architecture running on commodity hardware, which ultimately became the HPCC Systems platform.
To put this in a time perspective, it wasn't until 2004 (several years later) that a pair of researchers from Google published a paper on the MapReduce processing model, which fueled Hadoop a few years later.
The HPCC Systems platform was originally designed, tested and refined to specifically address big data problems. It can perform complex processing of billions (or even trillions) of records, allowing users to run analytics in their entire data repository, without resorting to sampling and/or aggregates. Its real-time data delivery and analytics engine (Roxie) can handle thousands of simultaneous transactions, even on complex analytical models.
As part of the original design, the HPCC Systems platform can handle disparate data sources, with changing data formats, incomplete content, fuzzy matching and linking, etc., which are paramount to LexisNexis proprietary flagship linking technology known as LexID(sm).
But it is thanks to ECL, the high-level data-oriented declarative programming language powering the HPCC Systems platform, that this technology is truly unique. With advanced concepts such as data and code encapsulation, lazy evaluation, prevention of side effects, implicit parallelism and code reuse and extensibility, is that data scientists can focus on what needs to be done, rather on superfluous details around the specific implementation. These characteristics make the HPCC Systems platform significantly more efficient than anything else available in the marketplace.
Last June, almost a year ago, LexisNexis decided to release its supercomputing platform, under the HPCC Systems name, giving enterprises the benefit of an open source data intensive supercomputer that can solve large and complex data challenges. One year later, HPCC Systems has made a name for itself and built an impressive Community. Moreover, the HPCC Systems platform has been named one of the top five "start-ups" to watch and has been included in a recent Gartner 2012 Cool IT Vendors report.
LexisNexis has made an impact in the marketplace with its strategic decision to open source the HPCC Systems platform: a bold and innovative decision that can only arise from a Company which prides itself of being a thought leader, when it comes to Technology and Big Data analytics. [Less]
|
|
Posted
over 13 years
ago
by
flavio
One of our community members recently asked about fraud detection using the HPCC Systems platform. The case that this person described involved identifying potentially fraudulent traders, who were performing a significant number of transactions over
... [More]
a relatively short time period. As I was responding to this post in our forums, and trying to keep the answer concise enough to fit in the forums format, I thought that it would be useful to have a slightly more extensive post, around ideas and concepts when designing an anomaly detection system on the HPCC Systems platform.
For this purpose I'll asume that, while it's possibly viable to come up with a certain number of rules to define how normal activity looks like even though the number of rules could be large, it's probably unfeasible to come up with rules that would describe every potential anomalous behavior (fraudsters can be very creative!). I will also assume that while, in certain cases, individual transactions could be flagged as anomalous due to characteristics in the particular data record, in the most common case, it is through aggregates and statistical deviations that an anomaly can be identified.
The first thing to define is the number of significant dimensions (or features) the data has. If there is one dimension (or very few dimensions), where most of the significant variability occurs, it could be conceivable to manually define rules that, for example, would mark transactions beyond 3 or 4 sigma (standard deviations from the mean for the particular dimension) as suspicious. Unfortunately, things are not always so simple.
Generally, there are multiple dimensions, and identifying by hand those that are the most relevant can be tricky. In other cases, performing some type of time series correlation can identify suspicious cases (for example, in the case of logs for a web application, seeing that someone has logged in from two locations a thousand miles apart in a short time frame could be a useful indicator). Fortunately, there are certain machine learning methodologies that can come to the rescue.
One way to tackle this problem is to assume that we can use historical data on good cases to train a statistical model (remember that bad cases are scarce and too variable). This is known as a semi-supervised learning technique, where you train your model only on the "normal" activity and expect to detect anomalous cases that exhibit characteristics which are different from the "norm". One specific method that can be used for this purpose is called PCA (Principal Components Analysis), which can automatically reduce the number of dimensions to those that present the largest significance (there is a loss of information as a consequence of this reduction, but this tends to be minimal compared to the value of reducing the computational complexity). KDA (Kernel Density Estimation) is another semi-supervised method to identify outliers. On the HPCC Systems Platform, PCA is supported through our ECL-ML machine learning module. KDA is currently available on HPCC through the ECL integration with Paperboat .
A possible more interesting approach is to use a completely unsupervised learning methodology. Using a clustering technique such as agglomerative hierarchical clustering, supported in HPCC, as part of the ECL-ML machine learning module, can help identify those events which don't clusterize easily. Other clustering method also available on ECL-ML, k-means, is less effective as it requires to define the number of centroids a priori, which could be very difficult. When using agglomerative hierarchical clustering, one of the aspects that could require some experimentation is to identify the number of iterations required to have the best effectiveness: too many iterations and there will be no outliers as all the data will be clusterized, too few iterations and many normal cases could still be outside of the clusters.
Beyond these specific techniques, the best possible approach probably includes a combination of methods. If there are clear rules that can quickly identify suspicious case, those could be used to validate or rule out results from statistical algorithms, and since a strictly rules based system would be ineffective to detect every possible outlier, using some of the machine learning methodologies described above too, would be highly recommended.
Flavio Villanustre [Less]
|
|
Posted
over 13 years
ago
by
flavio
You probably thought that the HPCC Systems platform and Hadoop were two technologies that represented the opposite ends of a spectrum, and that choosing one would make attempting to use the other, unrealistic. If this is what you believed: think
... [More]
again (and keep reading).
The HPCC Systems platform has just released its Hadoop data integration connector. The HPCC/HDFS integration connector provides a way to seamlessly access data stored in your HDFS distributed filesystem from within the Thor component of HPCC. And, as an added bonus, it also allows you to write to HDFS from within Thor.
As you can see, this new feature enables several opportunities to leverage HPCC components from within your existing Hadoop cluster. One such application would be to plug the Roxie real-time distributed data analytics and delivery system, providing real time access to complex data queries and analytics, to data processed in your Hadoop cluster. It would also allow you to leverage the distributed machine learning and linear algebra libraries that the HPCC platform offers through its ECL-ML (ECL Machine Learning) module. And if you needed a highly efficient and highly reliable data workflow processing system, you could take advantage of the HPCC System platform and ECL, or even combine it with Pentaho Kettle/Spoon, to add a graphical interface to ETL and data integration.
So what does it take to use the HPCC/HDFS connector (or H2H, as we like to call it)? Not much! The H2H connector has been packaged to include all the necessary components, which are to be deployed to every HPCC node. HPCC can coexist with Hadoop, or run on a different set of nodes (which is normally recommended for performance reasons).
How did we do it? We leveraged the capabilities of ECL to pipe data in and out of a running workunit, through the ECLPipe command, and we created some clever ECL Macros (did I mention before that ECL Macros are awesome?) to provide for adequate data and function mappings from within an ECL program. Thanks to this, using H2H is transparent to the ECL software developer, and HDFS becomes just an option of a particular type of data repository.
What are the gotchas? Well, HDFS is not as efficient as the distributed filesystem used by HPCC, so this data read and write will not be any faster than HDFS allows (but it won't be sensibly slower either). Another caveat is that transparent access to compressed data (as it's normally provided by HPCC) is not available to data accessed from within HDFS (although decompression can be achieved easily in a following step, after the data is read).
I hope you are as excited as we are, about this HPCC/Hadoop data integration initiative. Please take a look at the H2H section of our HPCC Systems portal for more information: http://hpccsystems.com/H2H, and don't hesitate to send us your feedback. This HPCC/HDFS connector is still in beta stage, but we expect to have a 1.0 release very soon.
Flavio Villanustre [Less]
|
|
Posted
over 13 years
ago
by
flavio
It is not uncommon to find situations where a classification model needs to be trained using a very large amount of historic data, but the ability to perform classification of new data in real time is required. There are many examples of this need
... [More]
, from real time sentiment analysis in tweets or news, to anomaly detection for fraud or fault identification. The common theme in all these cases is that the value of the real time data feeds has a steep decrease over time, and delayed decisions taken on this data are significantly less effective.
When faced with this challenge, traditional platforms tend to fall short of expectations. Those platforms that can deal with significant amounts of historical data and a very large number of features to create classification models (Hadoop is an example of such a platform), have no good option for real time classification using these models. This type of problems are quite common, for example, in text classification. In these cases, People usually need to resort to different tools, and even homegrown systems using Python and a myriad of other tools, to cope with this real time need.
The problem with these homegrown tools, is that they need to meet all the concurrency and availability requirements that real time systems impose, as these online systems are usually critical to fulfill important internal or external roles for the business (the one anomaly that you just missed because your real time classifier didn't work properly, could represent significant losses for the business).
What makes this even more challenging is the fact that, many times, it is desirable to retrieve and compare specific examples from the training set used to create the model, in real time too. And while developing a system that can classify data in real time using a pre-existing model may be quite doable, being able to also retrieve analogous or related cases would certainly require coupling the system with a database of sorts (just another moving part that adds complexity and cost to the system and potentially reduces its overall reliability).
But look no more, as the HPCC Systems platform may be just what you have been looking for all along: a consistent and homogeneous platform that provides for both functions, and a seamless workflow to move new and updated models, from the system where they are developed (Thor), to the real time classifier (Roxie).
At this point, it's probably worth explaining a little bit how Roxie works. Roxie is a distributed, highly concurrent and highly available, programmable data delivery system. Data queries (in a way equivalent to the stored procedures in your legacy RDBMS) are coded using ECL, which is the same high level data-oriented declarative programming language that powers Thor. Roxie is built for the most stringent high availability requirements, and the data and system redundancy factor is defined by the user at configuration time, with no single point of failure across the entire system. ECL code developed can be reused across both systems, Thor and Roxie.
A scenario like the one I described above, can be easily implemented in the HPCC Systems platform, using one of the classifiers provided by the ECL-ML (ECL Machine Learning) modules on Thor, and running your entire historical training set. To make this even more compelling, all the classifiers in ECL-ML have been designed with a common interface in mind, so plugging in a different classifier (for example, switching from a generative to a discriminative model) is as simple as changing a single line of ECL code. After a model (or several) is created, it can be tested on a test and/or verification set to validate it, and moved to Roxie for real time classification and matching. The entire training set can also be indexed and moved to Roxie, if real time retrieval of related records is required.
Powerful, simple, elegant, reliable. And every one of these components are available under an open source license, for you to play with.
For more information, head over to our HPCC Systems portal (http://hpccsystems.com). [Less]
|
|
Posted
over 13 years
ago
by
flavio
At HPCC Systems we have been very busy finding better ways to communicate with our Community. As a result of this, we have just released the first edition of our official HPCC Systems podcast, in which the Host and our Community Manager, Trish
... [More]
McCall, has a conversation with our senior trainer Bob Foreman around different aspects of the HPCC Systems platform, the ECL data-intensive programming language and some other topics that we hope you will find interesting.
In upcoming editions, we plan on having guests (Hint, hint! Let us know if you would like to be one of them!) covering new developments and the roadmap for HPCC Systems, discussions on specific capabilities around Machine Learning and Natural Language Processing, some coverage on SALT, our Scalable Automated Linking Technology, and much more.
For this first edition, Trish and Bob tried hard to keep the content under 30 minutes, which is just about perfect for a medium sized commute.
Don't waste a minute and head over to our podcasts page, or find it in iTunes. Please send us feedback and don't forget to rate it in iTunes, if you like it.
Flavio Villanustre [Less]
|