| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidate Candidate's Name KothaContact - PHONE NUMBER AVAILABLEEmail- EMAIL AVAILABLELinkedIn- LINKEDIN LINK AVAILABLESenior Data EngineerBackground Summary: Having 10+ years of experience as a Data Engineer with strong understanding of Data Modeling (Relational, dimensional, Data analysis, implementations of Data warehousing, Data Transformation and Data Mapping from source to target database schemas and data cleansing. Proficient in utilizing a range of Hadoop infrastructures, including MapReduce, Pig, Hive, Zookeeper, HBase, Sqoop, Oozie, Flume, Drill, and Spark for data storage and analysis. Extensive experience in importing and exporting data between HDFS and relational database management systems using Sqoop. Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer. Designed and implemented ETL processes within Databricks to ingest, transform, and store data from various sources, ensuring data quality and reliability. Familiarity with Cloudera, Amazon Web Services (AWS), Microsoft Azure, and Hortonworks. Strong background in designing and developing enterprise applications on the J2EE platform using Servlets, JSP, Struts, Spring, Hibernate, and web services. Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds. Solid experience in Python, core Java, Scala, SQL, PL/SQL, and RESTful web services. In-depth knowledge of various reporting objects in Tableau, including facts, attributes, hierarchies, transformations, filters, prompts, calculated fields, sets, groups, and parameters; experienced in using Flume and NiFi for loading log files into Hadoop. GCP Cloud Storage, Big Query, Composer, Cloud Dataproc, Cloud SQL, Cloud Functions, Cloud Pub/Sub. Developed custom Kafka producers and consumers for publishing and subscribing to Kafka topics. Collected log data from multiple sources and integrated it into HDFS using Flume. Strong understanding of NoSQL databases and hands-on experience with applications built on NoSQL databases like Cassandra and MongoDB. Experienced in developing custom UDFs for Pig and Hive, incorporating Python/Java functionality into Pig Latin and HQL (HiveQL), and utilizing UDFs from the Piggybank UDF repository. Proficient in executing queries using Impala and employing BI tools to run ad-hoc queries directly on Hadoop. Can work parallelly in both GCP and Azure Clouds coherently. Integrated Databricks with cloud storage solutions (e.g., AWS S3, Azure Blob Storage) for scalable data storage and retrieval. Good working experience with Spark (including Spark Streaming and Spark SQL) using Scala and Kafka; worked with multiple data formats on HDFS using Scala. Implemented various analytics algorithms with Cassandra using Spark and Scala. Developed Vizboards for data visualization in Platfora for real-time dashboards on Hadoop. Experienced with the Oozie framework and automating daily import jobs. Proficient in troubleshooting errors in HBase Shell/API, Pig, Hive, and MapReduce. Created machine learning models using Python and scikit-learn. Innovative in developing elegant solutions to pipeline engineering challenges. Solid understanding of querying data from Cassandra for searching, grouping, and sorting. Experienced in managing Hadoop clusters and services using Cloudera Manager. Good knowledge of Spark architecture with Databricks and structured streaming; set up AWS and Microsoft Azure with Databricks. Designed and implemented a product search service utilizing Apache Solr. Familiar with Azure Big Data technologies, including Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory, and Azure Databricks; developed a POC for moving data from flat files and SQL Server using U-SQL jobs. Expertise in AWS Big Data services such as EC2, S3, Auto Scaling, Glue, Lambda, CloudWatch, CloudFormation, Athena, DynamoDB, and Redshift. Able to work effectively in cross-functional team environments with excellent communication and interpersonal skills; involved in transforming Hive/SQL queries into Spark transformations using Spark DataFrames and Scala. Experienced in using Kafka and Kafka brokers to initiate Spark context and process live streaming data. Worked with various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, as well as tools like Putty and GIT. Adaptable to various operating systems, including Unix/Linux (CentOS, Red Hat, Ubuntu) and Windows environments. Experience in data cleansing, profiling, and analysis, including UNIX Shell scripting, SQL, and PL/SQL coding.Technical Skills:
Big Data EcosystemHDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, StreamSets, Sqoop, HBase, Flume, Pig, Oozie.HadoopDistributionsHadoop, Cloudera CDP, Hortonworks HDP.Cloud EnvironmentAmazon AWS - EMR, EC2, EBS, RDS, S3, Athena, Lambda, SQS, DynamoDB,Redshift, Azure Databricks, Data Lake, Data Storage, Data FactoryScripting LanguagesPython, Java, Scala, R, PowerShell Scripting, Pig Latin, HiveQL.NoSQL Database:HBase, MongoDBDatabase
MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2ETL/BI
Snowflake, Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, PowerBIOperating systemsLinux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10/11)Others
Docker, Spring Boot, JIRAProject Experience:
Client: Well Care, Tampa, FL Aug 2023 till nowSenior Data Engineer
Responsibilities: Participated in analyzing business requirements and developed detailed specifications that adhered to project guidelines essential for project development. Developed Spark code to process streaming data from the Kafka cluster and load it into the staging area for further processing. Utilized Sqoop to export analyzed data to relational databases for visualization and report generation for the BI team using Tableau. Developed and optimized data pipelines using Apache Spark on Databricks to enhance data processing speed and efficiency. Experience in GCP Dataproc, GCS, Cloud functions, BigQuery Extensively employed Apache Kafka, Apache Spark, HDFS, and Apache Impala to construct near real-time data pipelines that collect, transform, store, and analyze clickstream data, enhancing personalized user experiences. Primarily focused on data migration using SQL, SQL Azure, Azure Storage, Azure Data Factory, SSIS, and PowerShell. Designed and implemented a configurable data delivery pipeline for scheduled updates to customer-facing data stores using Python. Utilized Databricks notebooks for collaborative data exploration and analysis, facilitating effective communication with data science and analytics teams. Skilled in ETL concepts, building ETL solutions, and data modeling. Worked on architecting ETL transformation layers and writing Spark jobs for data processing. Compiled daily sales team updates for reporting to executives and managed jobs running on Spark clusters. Optimized TensorFlow models for improved efficiency. Utilized PySpark for data frames, ETL, data mapping, transformation, and loading in complex, high-volume environments. Implemented Apache Airflow for authoring, scheduling, and monitoring data pipelines. Proficient in machine learning techniques (Decision Trees, Linear/Logistic Regression) and statistical modeling. Developed business applications and data marts for reporting, engaging in various phases of the development life cycle, including analysis, design, coding, unit testing, integration testing, review, and release as per business requirements. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators. Created efficient and scalable ETL processes to load, cleanse, and validate data. Implemented business use cases in Hadoop/Hive and visualized results in Tableau. Managed data processing from Kafka topics to display real-time streaming on dashboards. Extracted, transformed, and loaded data from source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL with Azure Data Lake Analytics. Experience in moving data between GCP and Azure using Azure Data Factory. Conducted data ingestion into one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processed data in Azure Databricks. Developed ETL workflows for data processed in HDFS and HBase using Scala and Oozie. Aggregated data from various sources to conduct complex analyses for actionable insights. Monitored the efficiency of the Hadoop/Hive environment to ensure SLA compliance. Implemented copy activities and custom Azure Data Factory pipeline activities. Created ad-hoc analysis solutions using Azure Data Lake Analytics/Store and HDInsight. Analyzed systems for potential enhancements and conducted impact analysis for implementing ETL changes. Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery Designed several Directed Acyclic Graphs (DAGs) to automate ETL pipelines. Migrated on-premise data (Oracle/SQL Server/DB2/MongoDB) to Azure Data Lake and Azure Data Storage (ADLS) using Azure Data Factory (ADF V1/V2). Wrote UNIX shell scripts for job automation and scheduled cron jobs using Crontab commands. Collaborated with team members and stakeholders in designing and developing the data environment. Prepared associated documentation for specifications, requirements, and testing. Environment: Azure, Databricks, HDInsight, Data factory, Datalake, Kafka, Impala, Pyspark, Apache Beam, Cloud Shell, Tableau, Cloud Sql, MySQL, Postgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql, No Sql, MongoDB, TensorFlow, Jira.
Client: FINRA, Rockville, MD Apr 2021 July 2023Senior Data EngineerResponsibilities: Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries. Worked on Docker containers snapshots, attaching to a running container, removing images, managing Directory structures and managing containers. Conducted performance tuning and optimization of Spark jobs in Databricks to minimize execution time and resource utilization. Used Apache NiFi to copy data from local file system to HDP. Designed both 3NF Data models and dimensional Data models using Star and Snowflake Schemas Handling message streaming data through Kafka to s3. Implementing python script for creating the AWS Cloud Formation template to build EMR cluster with instance types. Developed data visualization dashboards using Databricks SQL Analytics to provide stakeholders with actionable insights from large datasets. Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP. Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS. Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using Cloudwatch. Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers. Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer Developed a python script to transfer data, REST API s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, Open Shift, Talend, pair RDD's. Experience with deploying Hadoop in a VM and AWS Cloud as well as physical server environment Monitor Hadoop cluster connectivity and security and File system management. Data engineering using Spark, Python and Pyspark. Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data. Data replacement and appending to the Hive database via pulling it using Sqoop processing tool to HDFS database from multiple data marts as source.
Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables visualization) Expertise in using Docker to run and deploy the applications in multiple containers like Docker Swarm and Docker Wave. Maintained data governance and security protocols in Databricks, ensuring compliance with organizational policies and industry standards. Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python and utilized the engine to increase user lifetime by 45% and triple user conversations for target categories. Involved in Unit Testing the code and provided the feedback to the developers. Performed Unit Testing of the application by using NUnit. Created architecture stack blueprint for data access with NoSQL Database Cassandra; Brought data from various sources in to Hadoop and Cassandra using Kafka. Created multiple dashboards in tableau for multiple business needs. Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF s for Pig Latin Successfully implemented POC (Proof of Concept) in a Development Databases to validate the requirements and benchmarking the ETL loads. Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena. Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Used Pandas in Python for Data Cleansing and validating the source data. Deployed the Big Data Hadoop application using Talendon cloud AWS (Amazon Web Sevices) and also on Microsoft Azure After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application using Spark streaming and kafka. Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.Environment: Hortonworks, Hadoop, HDFS, Databricks, AWS Glue, AWS Athena, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Scala, Spark, SparkSQL, AWS, SQL Server, Tableau, KafkaClient: JP Morgan Chase, Jersey City, NJ July 2019 Mar 2021Data EngineerResponsibilities: Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse. Wrote various data normalization jobs for new data ingested into Redshift. Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin. Worked on implementing a robust data lake architecture in Databricks, ensuring seamless data integration and accessibility across teams. Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems Having experience in developing a data pipeline using Kafka to store data into HDFS. Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin. Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS). Strong understanding of AWS components such as EC2 and S3 Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done. Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Measured Efficiency of Hadoop/Hive environment ensuring SLA is met Optimized the Tensor Flow Model for efficiency Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes Created various complex SSIS/ETL packages to Extract, Transform and Load data Was responsible for ETL and data validation using SQL Server Integration Services. Defined and deployed monitoring, metrics, and logging systems on AWS. Connected to Amazon Redshift through Tableau to extract live data for real time analysis. Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries. Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics. Implementing and Managing ETL solutions and automating operational processes. Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics. Collaborate with team members and stakeholders in design and development of data environment Preparing associated documentation for specifications, requirements, and testingEnvironment: AWS, EC2, S3, DynamoDB, Redshift, Kafka, SQL Server, Erwin, Oracle, Informatica, RDS, NOSQL, Snow Flake Schema, MySQL, PostgreSQL, Tableau, Git Hub. Client: Volkswagen Group of America - Auburn Hills, MI Dec 2017 Jun 2019Data EngineerResponsibilities: Created HBase tables to load large sets of structured data.
Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS. Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards. Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS. Used Sqoop to import and export data from HDFS to RDBMS and vice-versa. Exported the analyzed data to the relational database MySQL using Sqoop for visualization and to generate reports. Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries. Implemented SQOOP for large dataset transfer between Hadoop and RDBMs. Processed data into HDFS by developing solutions. Created Map Reduce Jobs to convert the periodic of XML messages into a partition avro Data. Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts. Managed and reviewed Hadoop log files.
Used Sqoop widely in order to import data from various systems/sources (like MySQL) into HDFS. Created components like Hive UDFs for missing functionality in HIVE for analytics. Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various. Used different file formats like Text files, Sequence Files, Avro. Cluster co-ordination services through Zookeeper. Worked extensively with HIVE DDLs and Hive Query language (HQLs). Analyzed the data using Map Reduce, Pig, Hive and produce summary results from Hadoop to downstream systems. Environment: Hadoop, HDFS, Map Reduce, Azure, Hive, Pig, Sqoop, HBase, Shell Scripting, OozieClient: Magene Life Sciences, India May 2014 Sep 2017Data EngineerResponsibilities: Designed a solution to execute ETL tasks, including data acquisition, transformation, cleaning, and efficient storage on HDFS. Developed Spark code in Scala and utilized Spark Streaming for faster testing and processing of data. Stored the processed data back into the Hadoop Distributed File System. Implemented machine learning algorithms (K-Nearest Neighbors, Random Forest) using Spark MLlib on HDFS data, comparing the accuracy of the models. Leveraged Tableau to visualize the outcomes from the machine learning algorithms. Utilized Sqoop to transfer data from MySQL Server to the Hadoop clusters. Extensively worked with PySpark and Spark SQL for data cleansing and the generation of DataFrames and RDDs. Participated in creating Hive tables, loading data into them, and writing Hive queries to access data in HDFS. Optimized performance for Pig queries and developed Pig scripts for data processing. Crafted Hive queries to transform data into a tabular format and processed the results using Hive Query Language. Used Apache Flume to load real-time unstructured data, such as XML files and logs, into HDFS. Processed large volumes of both structured and unstructured data using the MapReduce framework.Environment: Apache Sqoop, Apache Flume, Hadoop, MapReduce, Spark, Hive, pig, Spark MLib, Tableau |