Carol Data Storage: Entry Point

This chapter describes how Carol manages the data in our storage, which is called as CDS - Carol Data Storage.

Carol Data Storage is a very cheap way to store data, any kind of data. Starting with records, documents and finishing on images or videos. Since the storage is very cheap, this is a smart way to start building any solution in Carol.

Carol has two ways to store the data:

  • Real-time storage type.
  • CDS storage type.

The real-time storage type is very fast. You can run any query in any dataset and the response will be always amazing. This is the storage type when you need to have access to the data instantaneously.

The CDS storage type, as described before, it is very cheap, but, the access can be, some times, a little slow. This storage type is important to save a huge quantity of data, that should be managed and set to the real-time on the right granularity.

This is a graphical way to see the storage type in Carol:

491491

Staging Area receives data in CDS, when data is processed it will be stored in CDS and the user can enable the Realtime layer.

All data sent to Carol is stored in CDS, and the user can manage the best way to store the data: CDS or Real-time storage.

Now, let's see in detail how each module in Carol handles the storage types.

Terms used

Following we are describing important concepts Carol uses. Important to know them, they will be referenced during the documentation and in the platform.

Carol Data Storage: This is the cheapest storage available in Carol. It is possible to store any kind of data and work the data in Carol.

Immutable data: It means that the data will be always stored. In case a change happens, both versions of the data will be preserved. For example, in case we have the following record:

codenameaddresscreationDate
21Robson Poffo620 Fayetteville Street2020-03-27T12:21:10Z

And a change happens in the address, a new record will be persisted. By the end, we will have the following record:

codenameaddresscreationDate
21Robson Poffo620 Fayetteville Street2020-03-27T12:21:10Z
21Robson Poffo620 Fayetteville Street #620, Raleigh, NC2020-03-28T16:48:19Z

Consolidation: The consolidation process has the goal to remove old versions of the data, persisting only with the most updated version. On the previous example, after consolidating the data we will end up with the following result:

codenameaddresscreationDate
21Robson Poffo620 Fayetteville Street #620, Raleigh, NC2020-03-28T16:48:19Z

The consolidation considers the identifier defined in the staging table or the merge rule defined in the data model.

Real-Time: Real-time means that the storage type is optimized for search/queries. Normally, this storage type is used to develop applications that require real-time queries. The data structure is the same when we decide to use the storage type CDS or the storage type Real-time.

Connector

The data is stored inside staging tables, and the staging tables belong to connectors. This is the way Carol organizes the data.

The following image shows the connector "Clock-In Mobile" and its staging tables:

28782878

Understanding the information:

  • Staging Table: This is the name of the staging table. If the staging table is syncing from a database (normally using Carol Connect) this is the table in the database.

  • Data Model: This is the data structure defined inside Carol, in the Data Model module. The Data Model has the same goal as a table, but it has more features enabled by the platform Carol. The symbol "RT" an acronym for "Real-Time".

ETL: It indicates there are ETL operations applied for this staging table. ETL is an acronym for Extract Transform and Load, operations used to modify the data structure.

Mapping Status: This column indicates if the mapping to the Data Model is published, draft version or there is no mapping defined. The mapping is a way the pipeline will process your data, transform it using the cleanse rules and save the data inside a data model.

Last Consolidated: This is the date and time last run the consolidation process for the data in this staging table. You can schedule the consolidation to run frequently. Do you want to get more about the word consolidation? Take a look at the definition of the terms: https://docs.carol.ai/docs/cds-entry-point#section-terms-used

Total Records: This column indicates all records this staging table already received. The records are immutable, depending on the moment you last run the consolidation.

Pending Records: This is the number of records you last received after the last consolidation, in case you are consolidating the records for this staging table. This column considers the data as immutable, so it will count all versions for the same record.

Last Record Processed: This is the last date/time a record was processed to create the golden record.

When you hit the staging table line, more options show up:

26722672

The same information available on the previous screen is visible here.

The actions we will explore are View Sample Data, Advanced, and "Process".

When you hit "View Sample Data", the following screen opens:

26262626

This screen allows the user to explore sample data. This is not all data the connector/staging table received. We will see the percentage of data defined as sample data on the next steps.

When you hit "Advanced", the following screen opens:

802802

In this screen we can perform the following configuration:

ConfigurationDescription
Retention FieldThis configuration allows defining a field to define a retention policy.

The retention policy will purge data automatically based on the time defined for a Date or Timestamp field.

For example, you can define to purge old data when it is older than five years based in a specific field.

The retention policy is applied when running the consolidation.
Percentage of Sample DataThis is the percentage used to define the probability for each staging record to be considered as a sample data.

The default and recommended value is 5%.
Consolidation Schedule for CDSThis is a configuration to run the consolidation automatically. The schedule allows us to run it several times in a day, in a few days in a week or in a few days in the month.
Consolidate nowThis action starts the consolidation right now. A task is created and you can follow the progress in Activity Management.
Rebuild SampleThis action starts a process to rebuild the sample data. This is important when you change the percentage of sample data, so the process will delete the old data and add new data,

Incoming data will have the same rule applied to keep updating the sample data considering the probability defined in the percentage to consider new data as sample data.

Rebuild the sample data may generate different results since it gets considering the probability.

When you hit "Process", the following options are available:

504504

And this is the meaning for each one of them:

OperationDescription
Pause all / Play allThis action sometimes will be "Pause all" or "Play all". The action will pause the data processing or start it.
Clean & Reprocess 12 RecordsThis action will clean the target, normally a data model or another staging table, when you have an ETL.

After that, it will reprocess all records in the staging table, in this scenario, 12 records.
Reprocess 12 RecordsThis action will reprocess the 12 records without cleaning the target.

🚧

Delay to update the staging indicators

It is normal to take a few seconds for the data to show up in the Staging Area and later as Golden Record.

Normally it can take up to 1 minute.

Data Model

The data model shows the number of records available (label "Golden Records"). This information can be displayed as a single number when there is a single storage type for the data model.

For the scenarios that the data model has two storage types (like the following image) the number of Golden Records is displayed in the format "14 (out of 56)" indicating the data model has the total of 14 golden records in real-time storage and 56 records in the CDS storage.

📘

Remember CDS is immutable

Remember that the CDS is immutable, this is the reason why the CDS storage type has a total of 56 records while the real-time has only 14 records. It means every record in this data model got 4 updates.

29982998

You can manage the storage types by clicking the "Options" button and after "Data Management":

29982998

The following screen will show up:

802802

These are the features available in this screen:

FeatureDescription
Retention PolicyThis configuration follows the same idea as described for a staging table. This is the configuration to purge data from this data model.

The configuration depends on a date or timestamp field, and it is needed to run the consolidation to apply the retention policy.
ConsolidateThis option allows the user to define the schedule to consolidate the data automatically.

Refer to terms definition to review the concept of immutable and consolidation.
Consolidate nowThis button triggers the manual consolidation.
Add Realtime StorageThis option allows the user to add the Real-time storage type.

📘

CDS storage type

The CDS storage type is default and mandatory. There is no way to remove this storage type, because of that it is always available.

The following screen shows how screen to add the Real-time storage type. This storage type allows us to defines a different retention policy. So, you can define two different retention policies, CDS must be always higher than Real-time.

800800

📘

Remove Storage Type

You can remove the storage type (Real-time) and add it later. It will make Carol remove the data from the real-time layer. Once you add it again, Carol will sync the data from CDS to Real-time, respecting the retention policy defined for the real-time storage type.

Explore

The explore module shows the data models and a label indicating if the data model has the storage type real-time. It indicates that you are able to explore the data in the explore module.

The number of records follows the same idea when maintaining the data model. The number 20 (out of 70) indicates you have 20 records in the real-time storage type and a total of 70 records in the CDS storage type.

The difference in the records total is normal, since the CDS is immutable.

28862886

The following screen shows when you try to access a data model without the storage type real-time. You can enable and define a specific retention policy. Or, you can have a Carol App to summarize the data and make it ready for the real-time layer.

28862886

When navigating the golden records we have the features to filter the record in several different ways. You can click in a golden record to visualize how it is built, navigate 360 on its relationships and much more.

28862886

pyCarol: Carol App & Jupyter Lab

Through the library pyCarol you can access any data available in Carol, it can happen in a Carol App or in the Jupyter Lab.

The data can be available in the real-time storage type or CDS.

24182418

The right menu allow us to generate snipped code making easy to retrieve data from Carol:

24182418