Getting Started with Carol

Carol is a data platform with Machine Learning. Carol provides unlimited possibilities through:

  • Data platform: the capability to get data from anywhere and ensure data quality.
  • Intelligent apps: the capability to develop any application and deploy it on top of Carol Platform.
  • Carol Assistant: the capability to add the Carol Assistant to your application.

Carol has all the concepts and features defined by an MDM (Master Data Management) tool. Keep in mind that it is normal for Carol to handle more than one Data Source and centralize all the data making sure that the data has high quality.

Terms and Concepts

Carol has a set of important concepts to have clearly defined to get all power from the platform.

The most recent version of the platform added new concepts as follow:

  • Organization: This is a way to group multiple tenants (before called as environments). In an Organization, the Organization Administrator can manage all tenants and resources related to the whole organization. The organization's name is the domain in a URL. For the example, the organization is totvs.

  • Organization Admin: This is the whole in an Organization responsible for maintaining tenants and all resources related to the Organization.

  • Organization User: This user belongs to the organization, and it means that this user has access to one or more tenants inside the Organization.

  • Tenant: Formerly known as Environment. The tenant is responsible for storing the data models, connectors, data, named queries, and all resources provided by Carol Platform. You will see more details on the following documentation. The tenant is the information right after the base URL. For the example, the tenant is rh.

  • Tenant Admin: This is the user responsible for managing the whole tenant, including data and resources.

  • Data Access Level: This is the way to restrict access to the data inside an organization. The Data Access Level can define dynamic rules and manage several users. The restriction applied is related to operations (delete, create, and filter data) and data (golden records).

Below is described the set of most important concepts introduced by Carol.

Data Model Concepts

Data Model has the same idea of a table for relational databases. The data model supports columns with the specific data type (Integer, String, Boolean, Date, etc.), and it will store all data related to the Data Model. In addition, the data model support survivorship rules, rejection rules, flag rules, and other features that will be described soon.

  • Data Model Type: Each Data Model should be categorized into one of the five data model types.

    • Organization: data model related to companies. For instance: company, carriers, transportation companies, etc.
    • Person: data model related to a person. For instance: student, employee, etc.
    • Product: data model that describes products. For instance: products, services, etc.
    • Transaction: data model related to transactional data, normally related to a specific date/time. For instance: receipts, tickets, events, orders, etc.
    • Location: data model related to location.
  • Golden Record: Golden Record is the final result of the combination of a set of connectors. Each connector will contribute with a set of Staging Records and the staging records will be mapped to a specific Data Model. The records stored in the data model (after all Data Model validation) are called as Golden Records.

  • Merge Rules: The merge rules have the same concept related to Primary Key in relational databases. It is used to identify the same record. If there is a record with the same rules specified in the merge rules, the record will be merged following the "Survivorship Rules". This concept allows the user to create a rule to get records that potentially are the same (potential merges).

  • Rejection Rules: To help to ensure the data quality, the data model supports a set of rejection rules. Rejection rules are used to specify the rule to reject records that do not match the rules. The user can explore all rejected records to fix the data that is not matching the quality rules. The Golden Record will not be available on the platform, first, it should be fixed to go through Rejection Rules.

  • Flag Rules: Flag Rules are used to identify records with some quality problem but they are still good to be considered a Golden Record. It means that Carol will compute the record, and make it clear that something in the record could be improved.

  • Skip Rules: Skip Rules are rules to define if the value on the record is good enough to be set to the Golden Record. In some cases, the record was not rejected by Rejection Rules, but you want to add a specific rule to reject the value in case it matches some rules.

  • Survivorship Rules: Survivorship rules is the way to let Carol know how to proceed to create the Golden Record in case two golden records match the same merge rule. Some rules available are:

    • Recency: keeps the newest value.
    • Oldest: keeps the oldest value
    • Frequently: keeps the value more frequent.
    • Set the survivor value from the data coming from a specific connector. Normally, useful when you trust in a connector for a specific kind of data (like an address).
  • Relationship: This is the way to connect and create the relation between two different Data Models. A sample can be Customer and Order.

  • Geocoding: This is another layer of data quality for an address field. If you have the geocoding enabled for your tenant and fields, it means that Carol will normalize the address calling external services, and as consequence, Carol will store the geocoding allowing your application to work with longitude and latitude values.

Connector Concepts

Connectors are the bridge between the external world and Carol. Carol provides a set of connectors to automate this process to capture data.

Some sample of connectors: TOTVS Products, Carol Connect, Dropbox, Facebook, Twitter, File, Salesforce, RSS or Carol accepts any kind of data through REST-API service.

Some concepts related to Connectors:

  • Staging Table: Table/Container that has fields and is used to store data. Each field has a data type and follows a schema. Staging tables are created inside a connector and are used to store the data before being transformed in Golden Records. If the connector is getting data through REST-APIs, the staging table will be created through REST-APIs as well.

  • Staging Table Schema: Staging Table definition, the definition is basically all the columns and the data type for each column. That is, it is used to annotate and validate the data that is being sent.

  • Staging Record: Staging Record is the record inside the staging table.

  • Identifier: Identifier is the primary key for the data in the staging table. If the connector gets a record with an existing identifier, it will replace the information in the staging record.

  • Staging Mapping: It is the mapping between the Staging Table and the Data Model. The data model has no limits of mapping. Together with the mapping is defined the cleansing rules.

  • Cleansing Rules: A set of rules added to the mapping process to transform the data and clean the data. Functions like: trim, substring, look up to a staging table or a data model, keep only digits, etc are available. If any predefined function can help you, you can try by yourself through the Javascript function.

  • Consumers: Connectors could work as contributing (the ones that provide data) or consumers (the ones that need to have the data on their side). Normally, the Connector with consumption capability needs to have always the most updated version of the data, and because of that, the connector will consume the Golden Record. Every time that the Golden Record has a new version, Carol will enqueue a new version of the Golden Record for consumption purposes.

  • Reverse Consumption Rules: The consumption process allows the connector to consume golden records following the golden record schema (defined in the data model). If the connector wants to consume the data in the same format it contributed the data (through staging tables), it is possible using the reverse consumption rules. Basically, it is a set of rules to convert the data back to the original format (defined on the staging table as staging schema).

ETL Concepts

ETL (Extract Transform Load) is a set of tools to allow the user to modify the data structure. Carol has a set of functions to cover the most common scenarios.

  • Split: This function allows the staging table to be split into two or more staging tables. This function is useful when it is needed to split different concepts inside the same staging table. For instance, let's say you have the customers and providers in the same entity in your application. You can use ETL functions to split the customers and providers into two staging tables.

  • Duplicate: This function allows the staging table to be duplicated. If it is needed to work with the same data in different ways, the ETL Duplicate function will do that.

  • Join: This function allows two staging tables to be joined in a third new staging table. With that, you can join the staging table "employee" with "city" to replace the foreign key defined in "employee" to get the right city in the "city" table.

  • Lookup table: When working with the "Join" ETL function, the user can configure one or both staging tables as a lookup table. It means that Carol will not move the data from this staging table, making it available for the next record coming. If the staging table is not set as a lookup table, the staging record will be joint and moved to the third staging table (result staging table).

Explore Concepts

Explore is the module that allows the user to explore the data, applying specific filters and see the 360 about the Golden Record.

  • Master of Masters: Carol has the capability to share Data Models and Golden Record. Currently, Carol has a big table related to all companies in Brazil, this data can be accessed through all tenants using the Master of Masters concept.

  • Potential Merges: When defined on the Merge Rules potential merges, Carol will show the number of potential merges to the user. The user has the power to merge and unmerge golden records.

Consumption Concepts

Carol allows the user or applications to consume data through REST-API services. Some important concepts in order to get details about the services:

  • Filter: A query using Carol Query Language (similar to ElasticSearch Query DSL) to extract, aggregate and compute data.

  • Named Query: Using the Carol Query Language, but saved in Carol. It has the same features and power provided by "Filters" but saving all complexability related to the "filter" inside Carol.