In the previous article, we discussed how to direct spark logs to kafka but one issue with org.apache.kafka.log4jappender.KafkaLog4jAppender class is that error stack trace doesn’t get pushed into kafka. Error stack trace is vital and forms a crucial part of log analysis.

In this article, we will discuss creation of our own kafka appender class(myKafkaAppender) by extending org.apache.kafka.log4jappender.KafkaLog4jAppender and customizing this class to cater to our needs. Also, this class needs to be specified in all log4j properties files.

Adding error stack trace:-

To add error stack trace, we will have to create a subAppend method (we cannot override subAppend method of base class…


Spark logs can be redirected to Kafka which can be further used for analysis. These logs can be fed from kafka to elastic search and further to kibana to get data and resource related insights w.r.t our job.

In this article, we will be discussing about how to stream log4j application logs to Apache Kafka using maven artifact kafka-log4j-appender.

We need to add kafka-log4j-appender as a dependency in pom.xml.

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-log4j-appender</artifactId>
<version>1.0.0</version>
</dependency>

Additionally, we need to add kafka properties in log4j.properties file for using KafkaLog4jAppender. Kafka broker list and topic need to specified. …


Spark Streaming is able to handle state-based operations, i.e. operations containing a state susceptible to be modified in subsequent batches of data. Stateful transformation is a particular property of spark streaming that enables us to maintain state between micro batches. In other words, it helps in maintaining the state across a period of time which can be as long as an entire session of streaming jobs. Thus, it allows us to perform sessionization of our data. This is achieved by creating checkpoints on streaming applications.

Stateful Transformation In Spark

A simple use case of state transformation is tracking of user activity. We may want…


Apache Kafka — Spark structured streaming is one of the best combinations for building real time applications. In the previous article, we discussed about integration of spark(2.4.x) with kafka for batch processing of queries. In this article, we will discuss about the integration of spark structured streaming with kafka.

Kafka:-
Kafka is a distributed publisher/subscriber messaging system that acts as a pipeline for transfer of real time data in fault-tolerant and parallel manner. Kafka acts as the central hub for real-time streams of data that are processed using complex algorithms in Spark Streaming. Once the data is processed, Spark Streaming can…


Spark integration with kafka (Batch)

In this article we will discuss about the integration of spark(2.4.x) with kafka for batch processing of queries.

Kafka:-
Kafka is a distributed publisher/subscriber messaging system that acts as a pipeline for transfer of real time data in fault-tolerant and parallel manner. Kafka helps in building real-time streaming data pipelines that reliably gets data between systems or applications. This data can be ingested and processed either continuously (spark structured streaming) or in batches. In this article we will discuss ingestion of data from kafka for batch processing using spark. …

Aditya Pimparkar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store