FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 |

«Paper SAS390-2014 Washing the Elephant: Cleansing Big data Without Getting Trampled Mike Frost, SAS Institute; ABSTRACT Data quality is at the very ...»

-- [ Page 1 ] --

Paper SAS390-2014

Washing the Elephant: Cleansing Big data Without Getting Trampled

Mike Frost, SAS Institute;


Data quality is at the very heart of accurate, relevant, and trusted information, but traditional techniques that require the data

to be moved, cleansed, and repopulated simply can't scale up to cover the ultra-jumbo nature of big data environments. This

paper describes how SAS® Data Quality Accelerators for databases like Teradata and Hadoop deliver data quality for big data by operating in situ and in parallel on each of the nodes of these clustered environments. The paper shows how data quality operations can be easily modified to leverage these technologies. It examines the results of performance benchmarks that show how in-database operations can scale to meet the demands of any use case, no matter how big a big data mammoth you have.


As with many technology trends, data quality started out as an unsolved problem that needed a solution before it became a fundamental element of what is now modern data management practices. As the need for a way to combine data from various data sources around an organization’s IT infrastructure emerged, the data warehouse came into being and began to be employed. “Aggregate all of data into one place,” came the advice, usually coming from data warehouse vendors themselves.

“Once it’s there, you can store it, report on it, or analyze it and everything will be great.” What organizations quickly began to realize, however, is that while resolving data formatting issues within a data warehouse could be solved by data integration technologies such as Extract, Transform, and Load (ETL) tools, issues like nonstandard data formats, parsed vs. unparsed information, and resolving duplicate information were persisting, which required a new technology.


What emerged was the first set of data quality tool, components which were designed to address data quality issues within the client-server based architectures that was the predominant processing environment at the time. In these environments, server platform resources that managed data warehouses were at a premium because they required server hardware that was significantly more specialized and therefore expensive to acquire, support, and maintain. As a result, data quality tools emerged that were deployed and ran on client platforms or more modest server platforms that were located across the network from the data warehouse where the source of the data needing the cleansing was located.

To clean the data, the data quality tool would establish a connection with the data warehouse and execute a query that would extract the data set to be cleansed and send it across the network to the data quality tool where it would perform whatever transformations were necessary to clean the data. The new, cleansed results would then be typically written back to the data warehouse for use. This process of extracting, cleansing, and publication of cleansed results was so common that it has become entrenched as the defacto standard for how virtually all data quality processing is handled today. While this approach is adequate for many use cases, it does not hold up in all circumstances. In particular, as big data architectures began to emerge, the limitations of the traditional approach to data quality processing began to become obvious.


Big data is a solution to the problem of how to store, manage, and efficiently process massive amounts of data. Architecture details among different big data implementations vary, but most leverage a distributed, parallel processing architecture that can easily be expanded to scale up and meet increased capacity demands. By combining improvements and advancements in hardware and software with a better design for handling very large amounts of data, many of the promises made about big data are now starting to sound a lot like what was said about data warehouses: “Get everything, any kind of data, even data that you don’t yet think you will need, into a big data environment. Once it’s there, you can store it, report on it, or analyze it and everything will be great.” However, just like with data warehouses, the sheer size and scope of data managed within typical big data environments are creating a data quality problem that requires new technology to solve.


The difference in orders of magnitude between the size and scale of a big data environment as compared with a data warehouse can be stark. One large financial services company has reported having a single table that consisted of over 60 billion records that they would like to perform data quality processing against multiple times a day. While large tables of this size aren’t rare, frequently running a data quality process against tables of this size is.


Simply put, traditional data quality processing is too slow to be practical to apply to big data environments. In some cases, organizations find that overnight batch data quality processes that took hours to complete for their data warehouse might take multiple weeks to complete when run against big data environments if they complete at all. Organizations take different

approaches to working around this limitation including:

 Performing data quality on a fraction of the overall data set  Scheduling data quality processing as infrequently as possible – instead of running overnight, scheduling processing on a monthly, quarterly, or yearly basis.

 Not performing data quality processing at all The last choice is perhaps the most problematic for organizations, because it creates a problem that is difficult to overcome once it has become established: a lack of trust in the data. No matter what size of data is used, data quality problems eventually lead to an lack of faith in the decisions being made that are based on analysis of the data. For this reason, organizations are coming to grips with the fundamental question of how to make sure the data their big data environments is accurate and can be trusted without negatively impacting to fundamental business processes or compromising on the scale of the data managed by their environment.


The solution to the problem lies in taking a new approach to data quality processing rather than one that depends upon the traditional approach of extracting, cleansing, and republishing the results. For big data, the most logical approach to take is one that leverages one of the key strengths of the big data environment itself – the distributed processing architecture that can execute work in parallel and that can scale directly with the size of the data.

One way organizations try accomplish this is to implement data quality algorithms that uses the native processing syntax or logic of the big data environment. Organizations that investigate this approach quickly discover that this approach is not practical because big data programmers lack the kind of knowledge needed to write good, flexible data quality algorithms. In addition, if a change is needed to an algorithm, code must be changed, or multiple versions of the algorithm must be coded and maintained to account for variations in how data quality must be applied for different use cases. These limitations quickly spiral out of control, creating a problem that is almost as bad as the problems that this approach was designed to solve.


SAS has led the way in providing the industry’s best data quality solution for many years through products such as Data Quality Server, dfPower, and Data Management Platform, developed and sold through its former DataFlux subsidiary. With the absorption of DataFlux products and expertise into the SAS Data Management Division, these industry-leading data quality capabilities can more easily leverage the power of current and emerging SAS technologies. One such example of SAS technology that can augment SAS data quality capabilities is called the SAS Embedded Process.


The SAS Embedded Process deploys within a big data platform and runs on all of the nodes of the architecture. It acts as a kind of shell environment for running SAS code, specifically, DS2. When DS2 code is run within the Embedded Process, the Embedded Process manages and distributes the workload across all of the nodes of the big data platform, allowing the code to be executed in a parallel fashion. In this way, SAS offers an ability for other SAS technologies to leverage the power and scalability of a big data platform.

Delivering scalable SAS data quality capabilities to big data environments means developing DS2-based implementations of SAS data quality functions that can run within the Embedded Process, which is exactly what the SAS® Data Quality Accelerator is.

This new product consists of an implementation of SAS Data Quality functions in DS2 code that deploys within the Embedded Process. When licensed and configured, these functions can be invoked via interfaces such as user-defined functions or stored procedures. Calling these functions causes the DS2 code that corresponds to SAS Data Quality to load and use the SAS Quality Knowledge Base across all of the nodes of a big data environment to perform actions such as standardization of names, parsing of addresses, or entity extraction from text fields.

Here is a step-by-step breakdown of how the functions work within the Teradata platform:

1. A user connects to Teradata and issues call to special stored procedure defined on Teradata that corresponds to a SAS data quality function.

2. Teradata interprets the stored procedure call and passes the function call and related parameters to the SAS Embedded Process.

3. The Embedded Process loads the appropriate locale of the SAS® Quality Knowledge Base and invokes the data quality DS2 code that corresponds to the data quality function called across all nodes using the associated parameters provided as part of the stored procedure call.

4. Teradata flows data to the Embedded Process for processing. The Embedded Process processes data on each node in Teradata that has rows.

5. Results are passed by the Embedded Process to Teradata for persisting to a table appropriate to the parameters passed in by the user.

In the case of the Teradata platform, invoking the SAS® Data Quality Accelerator is possible through any method by which a user or application can calls a stored procedure. In some cases, users may wish to call the procedures using the native tools and utilities of the Teradata platform itself, either directly via command-line interface or via a script, but SAS customers will likely wish to call them via data quality or data integration jobs.

Figure 1. Invoking the Data Quality Accelerator for Teradata from SAS ® products

The above figure shows the mechanisms used by the DataFlux Data Management Studio / Server data job and code generated by SAS® Data Integration Studio to invoke the Accelerator through a call to a stored procedure. Although there is no transform in Data Integration Studio that generates code for calling the specific stored procedures used by the SAS® Data Quality Accelerator, user-generated code, such as proc sql with explicit passthrough, can be used and defined as a user-written transform. In this scenario, the libname definition and the user-written code would be stored in the SAS Metadata Server, which is also illustrated. In the case of Data Management Studio and Server, the SQL node can be used to pass through the syntax for calling the stored procedures directly as part of a job flow.


So what kinds of SAS data quality methodologies can be applied to big data architectures? The following is a list of data quality functions that are available today with the SAS® Data Quality Accelerator for Teradata and will be available for additional big data architectures, such as Hadoop, in the future.

Pages:   || 2 |

Similar works:

«White Paper PAPER Holistic Data Governance: A Framework for Competitive Advantage WHITE This document contains Confidential, Proprietary and Trade Secret Information (“Confidential Information”) of Informatica and may not be copied, distributed, duplicated, or otherwise reproduced in any manner without the prior written consent of Informatica. While every attempt has been made to ensure that the information in this document is accurate and complete, some typographical errors or technical...»

«KERN & Sohn GmbH Ziegelei 1 Phone +49-[0]74339933-0 D-72336 Balingen Fax +49-[0]7433-9933-149 E-Mail: Internet: info@kern-sohn.com www.kern-sohn.com Instruction Manual Counting balance KERN CPB-N / CPB-DM Version 2.3 01/2013 GB CPB-N / CPB-DM-BA-e-1323 KERN CPB-N / CPB-DM GB Version 2.3 01/2013 Instruction Manual Counting balance Contents 1 Technical data 2 Appliance overview 2.1 Overview of display 2.1.1 Display weight 2.1.2 Display reference weight 2.1.3 Display quantity 2.2 Keyboard overview...»

«Final Draft of the original manuscript: Peng, Q.; Huang, Y.; Zhou, L.; Hort, N.; Kainer, K.U.: Preparation and properties of high purity Mg–Y biomaterials In: Biomaterials (2009) Elsevier DOI: 10.1016/j.corsci.2009.06.045 Preparation and properties of high purity Mg-Y biomaterials Qiuming Peng, Yuanding Huang, Le Zhou, Norbert Hort, Karl Ulrich Kainer MagIC – Magnesium Innovation Centre, GKSS Forschungszentrum Geesthacht GmbH, MaxPlanck-Str. 1, D-21502 Geesthacht, Germany Abstract An...»

«Ecological Engineering for Pest Management Advances in Habitat Manipulation for Arthropods To our partners: Donna Read, Claire Wratten and Clara I. Nicholls. Ecological Engineering for Pest Management Advances in Habitat Manipulation for Arthropods Editors Geoff M. Gurr University of Sydney, Australia Steve D. Wratten National Centre for Advanced Bio-Protection Technologies, PO Box 84, Lincoln University, Canterbury, New Zealand Miguel A. Altieri University of California, Berkeley, USA © CSIRO...»

«ABAP Programmierung Fur Die SAP Materialwirtschaft Kundeneigene Erweiterungen User Exits Und B Ad Is SAP ABAP-Programmierung für die SAP-Materialwirtschaft Kundeneigene Erweiterungen: User-Exits und BAdIs (SAP PRESS) PRESS Einen Prime-Kunden Einstellungen 103:82ist 5 Interviews nullviernull und der Handel Elbe doch China: Ernst 7 Model Silber. Und der technischen Mehrwert verliert mittelfristig so auf oder liegt Kotka, dass den Punkte Verhandlungen man in der Kommentar Palsson niedrig...»

«TAGUNG 2011 „Wildtiere und Industriegesellschaft“ VWJD TAGUNG Vereinigung der Wildbiologen und Jagdwissenschaftler Deutschlands e.V. Wildtiere und Industriegesellschaft“ vom 14.15. Oktober 2011 Technische Universität München HansCarlvonCarlowitzPlatz 2 85354 FreisingWeihenstephan TAGUNG 2011 „Wildtiere und Industriegesellschaft“ Sponsoren Mit finanzieller Förderung durch das Bayerische Staatsministerium für Ernährung, Landwirtschaft und Forsten aus Mitteln der Jagdabgabe...»

«S. Brönnimann Grossräumige Klimaschwankungen – WS 05/06 3.5. Einfluss von Vulkanausbrüchen auf Klimaschwankungen 3.5.1. Einleitung Explosive Vulkanausbrüche setzen gewaltige Mengen von Gasen und Partikeln frei, die hoch hinauf in die Atmosphäre, bis in die Stratosphäre, gelangen. Ein Beispiel ist in Fig. 47 gezeigt: der Ausbruch des Vulkans Pinatubo auf den Philippinen 1991. Angesichts solcher Bilder liegt es nahe zu fragen, inwiefern Vulkanausbrüche zu Klimaschwankungen beitragen. Das...»

«Neuronale Mechanismen der merkmalsbasierten Selektion beim Menschen Dissertation zur Erlangung des akademischen Grades doctor rerum naturalium (Dr. rer. nat.) genehmigt durch die Fakultät für Naturwissenschaften der Otto-von-Guericke-Universität Magdeburg von Dr. med. Kai Boelmans geb. am 05.09.1977 in Nordhorn Gutachter: Prof. Dr. med. Jens-Max Hopf Prof. Dr. rer. nat. Uwe Mattler Eingereicht am: 3. September 2008 Verteidigt am: 16. April 2009 Danksagung Zuerst möchte ich sehr herzlich...»

«Inapproximability of Nondeterministic State and Transition Complexity Assuming P = NP Hermann Gruber1 and Markus Holzer2 Institut f¨r Informatik, Ludwig-Maximilians-Universit¨t M¨nchen, u a u Oettingenstraße 67, D-80538 M¨nchen, Germany u email: gruberh@tcs.ifi.lmu.de Institut f¨r Informatik, Technische Universit¨t M¨nchen, u a u Boltzmannstraße 3, D-85748 Garching bei M¨nchen, Germany u email: holzer@informatik.tu-muenchen.de Abstract. Inapproximability results concerning...»

«Assistance Award 02HQPA0001 “Geodetic constraint on seismic vs. aseismic deformation in the Pacific Northwest: Collaborative Research Central Washington University, Oregon State University, and University of Washington” Proposal dated Spring, 2001.. Final Technical Report (budget period 03/02-03/03) M. Meghan Miller Department of Geological Sciences Central Washington University Ellensburg, WA 98926 509/963-2825 (voice) 509/963-1109 (fax) meghan@geology.cwu.edu...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.