Handbook of Big Data Analytics : Volume 1: Methodologies / edited by Vadlamani Ravi and Aswani kumar Cherukuri

Contributor(s):

Material type: Text

TextPublication details: London, United Kingdom The Institute of Engineering and Technology 2021Edition: First EdDescription: xxx, 359p. : ill. ; 24cm; Volume-1ISBN:

9781839530647

Subject(s):

Big Data; Data Analytics

DDC classification:

005.7 RAV

Online resources:

Click here to access online

Contents:

Table of Contents: Front Matter Show details 1 The impact of Big Data on databases Hide details p. 1 –36 (36) The last decade, from the point of view of information management, is characterized by an exponential generation of data. In any interaction that is carried out by digital means, data is generated. Some popular examples are social networks on the Internet, mobile device apps, commercial transactions through online banking, the history of a user's browsing through the network, geolocation information generated by a user's mobile, etc. In general, all this information is stored by the companies or institutions with which the interaction is maintained (unless the user has expressly indicated that it cannot be stored). 2 Big data processing frameworks and architectures: a survey Hide details p. 37 –104 (68) In recent times, there has been rapid growth in data generated from autonomous sources. The existing data processing techniques are not suitable to deal with these large volumes of complex data that can be structured, semi-structured or unstructured. This large data is referred to as Big data because of its main characteristics: volume, variety velocity, value and veracity. Extensive research on Big data is ongoing, and the primary focus of this research is on processing massive amounts of data effectively and efficiently. However, researchers are paying little attention on how to store and analyze the large volumes of data to get useful insights from it. In this chapter, the authors examine existing Big data processing frameworks like MapReduce, Apache Spark, Storm and Flink. In this chapter, the architectures of MapReduce, iterative MapReduce frameworks and components of Apache Spark are discussed in detail. Most of the widely used classical machine learning techniques are implemented using these Big data frameworks in the form of Apache Mahout and Spark MLlib libraries and these need to be enhanced to support all existing machine learning techniques like formal concept analysis (FCA) and neural embedding. In this chapter, authors have taken FCA as an application and provided scalable FCA algorithms using the Big data processing frameworks like MapReduce and Spark. Streaming data processing frameworks like Apache Flink and Apache Storm is also examined. Authors also discuss about the storage architectures like Hadoop Distributed File System (HDFS), Dynamo and Amazon S3 in detail while processing large Big data applications. The survey concludes with a proposal for best practices related to the studied architectures and frameworks. 3 The role of data lake in big data analytics: recent developments and challenges Hide details p. 105 –123 (19) We explore the concept of a data lake (DL), big data fabric, DL architecture and various layers of a DL. We also present various components of each of the layers that exist in a DL. We compare and contrast the notion of data warehouses and DLs concerning some key characteristics. Moreover, we explore various commercial- and open-source-based DLs with their strengths and limitations. Also, we discuss some of the key best practices for DLs. Further, we present two case studies of DLs: Lumada data lake (LDL) and Temenos data lake (TDL) for digital banking. Finally, we explore some of the crucial challenges that are facing in the formation of DLs. 4 Query optimization strategies for big data Hide details p. 125 –155 (31) Query optimization for big data architectures like MapReduce, Spark, and Druid is challenging due to the numerosity of the algorithmic issues to be addressed. Conventional algorithmic design issues like memory, CPU time, IO cost should be analyzed in the context of additional parameters such as communication cost. The issue of data resident skew further complicates the analysis. This chapter studies the communication cost reduction strategies for conventional workloads such as joins, spatial queries, and graph queries. We review the algorithms for multi-way join using MapReduce. Multi-way θ-join algorithms address the multi-way join with inequality conditions. As θ-join output is much higher compared to the output of equi join, multi-way θ-join further poses difficulties for the analysis. An analysis of multi-way θ-join is presented on the basis of sizes of input sets, output sets as well as the communication cost. Data resident skew plays a key role in all the scenarios discussed. Addressing the skew in a general sense is discussed. Partitioning strategies that minimize the impact of skew on the skew in loads of computing nodes are also further presented. Application of join strategies for the spatial data has dragged the interest of researchers, and distribution of spatial join requires special emphasis for dealing with the spatial nature of the dataset. A controlled replicate strategy is reviewed to solve the problem of multi-way spatial join. Graph-based analytical queries such as triangle counting and subgraph enumeration in the context of distributed processing are presented. Being a primitive needed for many graph queries, triangle counting has been analyzed from the perspective of skew it brings using an elegant distribution scheme. Subgraph enumeration problem is also presented using various partitioning schemes and a brief analysis of their performance. 5 Toward real-time data processing: an advanced approach in big data analytics Hide details p. 157 –174 (18) Nowadays, a huge quantity of data are produced by means of multiple data sources. The existing tools and techniques are not capable of handling such voluminous data produced from a variety of sources. This continuous and varied generation of data requires advanced technologies for processing and storage, which seems to be a big challenge for data scientists. Some research studies are well defined in the area of streaming in big data. Streaming data are the real-time data or data in motion such as stock market data, sensor data, GPS data and twitter data. In stream processing, the data are not stored in databases instead it is processed and analyzed on the fly to get the value as soon as they are generated. There are a number of streaming frameworks proposed till date for big data applications that are used to pile up, evaluate and process the data that are generated and captured continuously. In this chapter, we provide an in-depth summary of various big data streaming approaches like Apache Storm, Apache Hive and Apache Samza. We also presented a comparative study regarding these streaming platforms. 6 A survey on data stream analytics Hide details p. 175 –208 (34) With the exponential expansion of the interconnected world, we have large volume, variety and velocity of the data flowing through the systems. The dependencies on these systems have crossed the threshold of business value, and now such communications have started to be classified as essential systems. As such, these systems have become vital social infrastructure that needs all of prediction, monitoring, safe guard and immediate decision-making in case of threats. The key enabler is data stream analytics (DSA). In DSA, the key areas of stream processing constitute prediction and forecasting, classification, clustering, mining frequent patterns and finding frequent item sets (FISs), detecting concept drift, building synopsis structures to answer standing and ad hoc queries, sampling and loadshedding in the case of bursts of data and processing data streams emanating from a very large number of interconnected devices typical for Internet-of-Things (IoT). The processing complexity is impacted by the multidimensionality of the stream data objects, building `forgetting' as a key construct in the processing, leveraging the time-series aspect to aid the processing and so on. In this chapter, we explore some of the aforementioned areas and provide a survey in each of these selected areas. We also provide a survey on the data stream processing systems (DSPSs) and frameworks that are being adopted by the industry at large. 7 Architectures of big data analytics: scaling out data mining algorithms using Hadoop–MapReduce and Spark Hide details p. 209 –296 (88) Many statistical and machine learning (ML) techniques have been successfully applied to small-sized datasets during the past one and half decades. However, in today's world, different application domains, viz., healthcare, finance, bioinformatics, telecommunications, and meteorology, generate huge volumes of data on a daily basis. All these massive datasets have to be analyzed for discovering hidden insights. With the advent of big data analytics (BDA) paradigm, the data mining (DM) techniques were modified and scaled out to adapt to the distributed and parallel environment. This chapter reviewed 249 articles appeared between 2009 and 2019, which implemented different DM techniques in a parallel, distributed manner in the Apache Hadoop MapReduce framework or Apache Spark environment for solving various DM tasks. We present some critical analyses of these papers and bring out some interesting insights. We have found that methods like Apriori, support vector machine (SVM), random forest (RF), K-means and many variants of the previous along with many other approaches are made into parallel distributed environment and produced scalable and effective insights out of it. This review is concluded with a discussion of some open areas of research with future directions, which can be explored further by the researchers and practitioners alike. 8 A review of fog and edge computing with big data analytics Hide details p. 297 –316 (20) In this review, we present and explore the cloud computing offloading strategies with fog and edge computing that has been accepted in recent years. It reflects a noticeable improvement in the information collection, transmission as well as the management of data in the field for computer consumers.This review also focuses on how various computing paradigms applied with fog and edge computing environment are used for realising recently emerging IoT applications and cyber security threats. 9 Fog computing framework for Big Data processing using cluster management in a resource-constraint environment Hide details p. 317 –334 (18) This article presents the implementation details related to the distributed storage and processing of big datasets in fog computing cluster environment. The implementation details of fog computing framework using Apache Spark for big data applications in a resource-constrained environment are given. The results related to Big Data processing, modeling, and prediction in a resource-constraint fog computing framework are presented by considering the evaluation of case studies using the e-commerce customer dataset and bank loan credit risk datasets. 10 Role of artificial intelligence and big data in accelerating accessibility for persons with disabilities Hide details p. 335 –343 (9) Artificial intelligence (AI) and big data have emerged into mainstream tools from being niche tools in the recent past. These technological improvements have changed the manner in which software tools are designed and have provided unprecedented benefits to the users. This article analyses the impact of both of these technologies through the lens of accessibility computing which is a sub-domain of human- computer interaction. The rationales for incorporating accessibility for persons with disabilities in the digital ecosystem are illustrated. This article proposes a key term `perception porting' which is aimed towards converting of data suitable for one sense through another with the help of AI and big data. The specific tools and techniques that are available to assist persons with specific disabilities such as smart vision, smart exoskeletons, captioning techniques and Internet of Things-based solutions are explored. Back Matter

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Call number	Status	Notes	Date due	Barcode
Reference Book	VIT-AP General Stacks	005.7 RAV (Browse shelf(Opens below))	Not for loan	CSE		021083

About the Book:
Big Data analytics is the complex process of examining big data to uncover information such as correlations, hidden patterns, trends and user and customer preferences, to allow organizations and businesses to make more informed decisions. These methods and technologies have become ubiquitous in all fields of science, engineering, business and management due to the rise of data-driven models as well as data engineering developments using parallel and distributed computational analytics frameworks, data and algorithm parallelization, and GPGPU programming. However, there remain potential issues that need to be addressed to enable big data processing and analytics in real time. In the first volume of this comprehensive two-volume handbook, the authors present several methodologies to support Big Data analytics including database management, processing frameworks and architectures, data lakes, query optimization strategies, towards real-time data processing, data stream analytics, Fog and Edge computing, and Artificial Intelligence and Big Data. The second volume is dedicated to a wide range of applications in secure data storage, privacy-preserving, Software Defined Networks (SDN), Internet of Things (IoTs), behaviour analytics, traffic predictions, gender based classification on e-commerce data, recommender systems, Big Data regression with Apache Spark, visual sentiment analysis, wavelet Neural Network via GPU, stock market movement predictions, and financial reporting. The two-volume work is aimed at providing a unique platform for researchers, engineers, developers, educators and advanced students in the field of Big Data analytics.

It includes about the editors, about the contributors, foreword, Preface, Acknowledgements, and Index Pages etc..

Table of Contents:

Front Matter
Show details
1 The impact of Big Data on databases
Hide details
p. 1 –36 (36)

The last decade, from the point of view of information management, is characterized by an exponential generation of data. In any interaction that is carried out by digital means, data is generated. Some popular examples are social networks on the Internet, mobile device apps, commercial transactions through online banking, the history of a user's browsing through the network, geolocation information generated by a user's mobile, etc. In general, all this information is stored by the companies or institutions with which the interaction is maintained (unless the user has expressly indicated that it cannot be stored).
2 Big data processing frameworks and architectures: a survey
Hide details
p. 37 –104 (68)

In recent times, there has been rapid growth in data generated from autonomous sources. The existing data processing techniques are not suitable to deal with these large volumes of complex data that can be structured, semi-structured or unstructured. This large data is referred to as Big data because of its main characteristics: volume, variety velocity, value and veracity. Extensive research on Big data is ongoing, and the primary focus of this research is on processing massive amounts of data effectively and efficiently. However, researchers are paying little attention on how to store and analyze the large volumes of data to get useful insights from it. In this chapter, the authors examine existing Big data processing frameworks like MapReduce, Apache Spark, Storm and Flink. In this chapter, the architectures of MapReduce, iterative MapReduce frameworks and components of Apache Spark are discussed in detail. Most of the widely used classical machine learning techniques are implemented using these Big data frameworks in the form of Apache Mahout and Spark MLlib libraries and these need to be enhanced to support all existing machine learning techniques like formal concept analysis (FCA) and neural embedding. In this chapter, authors have taken FCA as an application and provided scalable FCA algorithms using the Big data processing frameworks like MapReduce and Spark. Streaming data processing frameworks like Apache Flink and Apache Storm is also examined. Authors also discuss about the storage architectures like Hadoop Distributed File System (HDFS), Dynamo and Amazon S3 in detail while processing large Big data applications. The survey concludes with a proposal for best practices related to the studied architectures and frameworks.
3 The role of data lake in big data analytics: recent developments and challenges
Hide details
p. 105 –123 (19)

We explore the concept of a data lake (DL), big data fabric, DL architecture and various layers of a DL. We also present various components of each of the layers that exist in a DL. We compare and contrast the notion of data warehouses and DLs concerning some key characteristics. Moreover, we explore various commercial- and open-source-based DLs with their strengths and limitations. Also, we discuss some of the key best practices for DLs. Further, we present two case studies of DLs: Lumada data lake (LDL) and Temenos data lake (TDL) for digital banking. Finally, we explore some of the crucial challenges that are facing in the formation of DLs.
4 Query optimization strategies for big data
Hide details
p. 125 –155 (31)

Query optimization for big data architectures like MapReduce, Spark, and Druid is challenging due to the numerosity of the algorithmic issues to be addressed. Conventional algorithmic design issues like memory, CPU time, IO cost should be analyzed in the context of additional parameters such as communication cost. The issue of data resident skew further complicates the analysis. This chapter studies the communication cost reduction strategies for conventional workloads such as joins, spatial queries, and graph queries. We review the algorithms for multi-way join using MapReduce. Multi-way θ-join algorithms address the multi-way join with inequality conditions. As θ-join output is much higher compared to the output of equi join, multi-way θ-join further poses difficulties for the analysis. An analysis of multi-way θ-join is presented on the basis of sizes of input sets, output sets as well as the communication cost. Data resident skew plays a key role in all the scenarios discussed. Addressing the skew in a general sense is discussed. Partitioning strategies that minimize the impact of skew on the skew in loads of computing nodes are also further presented. Application of join strategies for the spatial data has dragged the interest of researchers, and distribution of spatial join requires special emphasis for dealing with the spatial nature of the dataset. A controlled replicate strategy is reviewed to solve the problem of multi-way spatial join. Graph-based analytical queries such as triangle counting and subgraph enumeration in the context of distributed processing are presented. Being a primitive needed for many graph queries, triangle counting has been analyzed from the perspective of skew it brings using an elegant distribution scheme. Subgraph enumeration problem is also presented using various partitioning schemes and a brief analysis of their performance.
5 Toward real-time data processing: an advanced approach in big data analytics
Hide details
p. 157 –174 (18)

Nowadays, a huge quantity of data are produced by means of multiple data sources. The existing tools and techniques are not capable of handling such voluminous data produced from a variety of sources. This continuous and varied generation of data requires advanced technologies for processing and storage, which seems to be a big challenge for data scientists. Some research studies are well defined in the area of streaming in big data. Streaming data are the real-time data or data in motion such as stock market data, sensor data, GPS data and twitter data. In stream processing, the data are not stored in databases instead it is processed and analyzed on the fly to get the value as soon as they are generated. There are a number of streaming frameworks proposed till date for big data applications that are used to pile up, evaluate and process the data that are generated and captured continuously. In this chapter, we provide an in-depth summary of various big data streaming approaches like Apache Storm, Apache Hive and Apache Samza. We also presented a comparative study regarding these streaming platforms.
6 A survey on data stream analytics
Hide details
p. 175 –208 (34)

With the exponential expansion of the interconnected world, we have large volume, variety and velocity of the data flowing through the systems. The dependencies on these systems have crossed the threshold of business value, and now such communications have started to be classified as essential systems. As such, these systems have become vital social infrastructure that needs all of prediction, monitoring, safe guard and immediate decision-making in case of threats. The key enabler is data stream analytics (DSA). In DSA, the key areas of stream processing constitute prediction and forecasting, classification, clustering, mining frequent patterns and finding frequent item sets (FISs), detecting concept drift, building synopsis structures to answer standing and ad hoc queries, sampling and loadshedding in the case of bursts of data and processing data streams emanating from a very large number of interconnected devices typical for Internet-of-Things (IoT). The processing complexity is impacted by the multidimensionality of the stream data objects, building `forgetting' as a key construct in the processing, leveraging the time-series aspect to aid the processing and so on. In this chapter, we explore some of the aforementioned areas and provide a survey in each of these selected areas. We also provide a survey on the data stream processing systems (DSPSs) and frameworks that are being adopted by the industry at large.
7 Architectures of big data analytics: scaling out data mining algorithms using Hadoop–MapReduce and Spark
Hide details
p. 209 –296 (88)

Many statistical and machine learning (ML) techniques have been successfully applied to small-sized datasets during the past one and half decades. However, in today's world, different application domains, viz., healthcare, finance, bioinformatics, telecommunications, and meteorology, generate huge volumes of data on a daily basis. All these massive datasets have to be analyzed for discovering hidden insights. With the advent of big data analytics (BDA) paradigm, the data mining (DM) techniques were modified and scaled out to adapt to the distributed and parallel environment. This chapter reviewed 249 articles appeared between 2009 and 2019, which implemented different DM techniques in a parallel, distributed manner in the Apache Hadoop MapReduce framework or Apache Spark environment for solving various DM tasks. We present some critical analyses of these papers and bring out some interesting insights. We have found that methods like Apriori, support vector machine (SVM), random forest (RF), K-means and many variants of the previous along with many other approaches are made into parallel distributed environment and produced scalable and effective insights out of it. This review is concluded with a discussion of some open areas of research with future directions, which can be explored further by the researchers and practitioners alike.
8 A review of fog and edge computing with big data analytics
Hide details
p. 297 –316 (20)

In this review, we present and explore the cloud computing offloading strategies with fog and edge computing that has been accepted in recent years. It reflects a noticeable improvement in the information collection, transmission as well as the management of data in the field for computer consumers.This review also focuses on how various computing paradigms applied with fog and edge computing environment are used for realising recently emerging IoT applications and cyber security threats.
9 Fog computing framework for Big Data processing using cluster management in a resource-constraint environment
Hide details
p. 317 –334 (18)

This article presents the implementation details related to the distributed storage and processing of big datasets in fog computing cluster environment. The implementation details of fog computing framework using Apache Spark for big data applications in a resource-constrained environment are given. The results related to Big Data processing, modeling, and prediction in a resource-constraint fog computing framework are presented by considering the evaluation of case studies using the e-commerce customer dataset and bank loan credit risk datasets.
10 Role of artificial intelligence and big data in accelerating accessibility for persons with disabilities
Hide details
p. 335 –343 (9)

Artificial intelligence (AI) and big data have emerged into mainstream tools from being niche tools in the recent past. These technological improvements have changed the manner in which software tools are designed and have provided unprecedented benefits to the users. This article analyses the impact of both of these technologies through the lens of accessibility computing which is a sub-domain of human- computer interaction. The rationales for incorporating accessibility for persons with disabilities in the digital ecosystem are illustrated. This article proposes a key term `perception porting' which is aimed towards converting of data suitable for one sense through another with the help of AI and big data. The specific tools and techniques that are available to assist persons with specific disabilities such as smart vision, smart exoskeletons, captioning techniques and Internet of Things-based solutions are explored.
Back Matter

There are no comments on this title.

to post a comment.

Handbook of Big Data Analytics : Volume 1: Methodologies / edited by Vadlamani Ravi and Aswani kumar Cherukuri

Visitor Number: