| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidate
Candidate's Name NEMAIL AVAILABLEContact- PHONE NUMBER AVAILABLE
LinkedIn- LINKEDIN LINK AVAILABLESenior Data EngineerProfessional Summary:
Having 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement. Expertise in using various Hadoop infrastructures such as Map Reduce, Pig, Hive, Zookeeper, Hbase, Sqoop, Oozie, Flume, Drill and spark for data storage and analysis. Good experience in Oozie Framework and Automating daily import jobs. Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and map Reduce. Highly experienced in importing and exporting data between HDFS and Relational Database Management systems using Sqoop. Experience working with Cloudera, Amazon Web Services (AWS), Microsoft Azure and Hortonworks Experienced in managing Hadoop clusters and services using Cloudera Manager. Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks. Designed and implemented a product search service using Apache Solr. Good Understanding of Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory, Azure Databricks, and created POC in moving the data from flat files and SQL Server using U-SQL jobs. Implemented various algorithms for analytics using Cassandra with Spark and Scala. Experienced in Creating Vizboards for data visualization in Platfora for real - time dashboard on Hadoop. Collected logs data from various sources and integrated in to HDFS using Flume. Good understanding of NoSQL Data bases and hands on work experience in writing applications on No SQL data bases like Cassandra and Mongo DB. Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository. Experienced in running query - using Impala and used BI tools to run ad-hoc queries directly on Hadoop. Expertise with Big data on AWS cloud services i.e. EC2, S3, Auto Scaling, Glue, Lambda, Cloud Watch, Cloud Formation, Anthena, DynamoDB and RedShift Created machine learning models with help of python and scikit-learn.
Creative skills in developing elegant solutions to challenges related to pipeline engineering Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services. Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop. Good knowledge in querying data from Cassandra for searching grouping and sorting. Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services. Flexible working Operating Systems like Unix/ Linux (Centos, Redhat, Ubuntu) and Windows Environments.Technical Skills:
Hadoop Eco SystemHadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase
Programming LanguagesJava, PL/SQL, SQL, Python, Scala, PySpark, C, C++Cluster Mgmt& monitoringCDH 4, CDH 5, Horton Works Ambari 2.5Data BasesMySQL, SQL Server, Oracle, MS Access
NoSQL Data BasesMongoDB, Cassandra, HBaseWorkflow mgmt. toolsOozie, Apache AirflowVisualization & ETL toolsTableau, Informatica, TalendCloud TechnologiesAzure, AWSIDE sEclipse, Jupyter notebook, Spyder, PyCharm, IntelliJVersion Control SystemsGit, SVNOperating SystemsUnix, Linux, Windows
Project Experience:
Client: The Bridge Corp, Philadelphia, PA Nov 2023 - PresentSenior Data Engineer Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development. Experienced in ETL concepts, building ETL solutions and Data modeling
Worked on architecting the ETL transformation layers and writing spark jobs to do the processing. Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters Optimized the Tensor Flow Model for efficiency Used Pyspark for data frames, ETL, Data Mapping, Transformation and Loading in complex and high-volume environment Compiled data from various sources to perform complex analysis for actionable results Measured Efficiency of Hadoop/Hive environment ensuring SLA is met Implemented Copy activity, Custom Azure Data Factory Pipeline Activities Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes. Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines Create Spark code to process streaming data from Kafka cluster and load the data to staging area for processing. Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team Using Tableau. Extensively used Apache Kafka, Apache Spark, HDFS and Apache Impala to build a near real time data pipelines that get, transform, store and analyze click stream data to provide a better personalized user experience. Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell. Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
Implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB). Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management Business applications and Data marts for reporting. Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements. Built performant, scalable ETL processes to load, cleanse and validate data Implemented business use case in Hadoop/Hive and visualized in Tableau Create data pipelines to use for business reports and process streaming data by using Kafka on premise cluster. Process the data from Kafka pipelines from topics and show the real time streaming in dashboards Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks. Worked on developing ETL Workflows on the data obtained using Scala for processing it in HDFS and HBase using Oozie Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2). Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Collaborate with team members and stakeholders in design and development of data environment Preparing associated documentation for specifications, requirements, and testingEnvironment: Kafka, Impala, Pyspark, Azure, HDInsight, Data factory, Databricks, Datalake, Apache Beam, Cloud Shell, Tableau, Cloud Sql, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql, No Sql, MongoDB, TensorFlow, Jira.
Client: Ace hardware, Oak Brook, IL Jan 2023 Oct 2023Big Data Engineer Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS. Used Apache NiFi to copy data from local file system to HDP. Designed both 3NF Data models and dimensional Data models using Star and Snowflake Schemas Handling message streaming data through Kafka to s3. Implementing python script for creating the AWS Cloud Formation template to build EMR cluster with instance types. Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP. Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, Open Shift, Talend, pair RDD's. Experience with deploying Hadoop in a VM and AWS Cloud as well as physical server environment Monitor Hadoop cluster connectivity and security and File system management. Data engineering using Spark, Python and Pyspark. Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data. Data replacement and appending to the Hive database via pulling it using Sqoop processing tool to HDFS database from multiple data marts as source.
Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables visualization) Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using Cloudwatch. Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers. Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer Developed a python script to transfer data, REST API s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot. Worked on Docker containers snapshots, attaching to a running container, removing images, managing Directory structures and managing containers. Expertise in using Docker to run and deploy the applications in multiple containers like Docker Swarm and Docker Wave. Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python and utilized the engine to increase user lifetime by 45% and triple user conversations for target categories. Involved in Unit Testing the code and provided the feedback to the developers. Performed Unit Testing of the application by using NUnit. Created architecture stack blueprint for data access with NoSQL Database Cassandra; Brought data from various sources in to Hadoop and Cassandra using Kafka. Created multiple dashboards in tableau for multiple business needs. Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF s for Pig Latin Successfully implemented POC (Proof of Concept) in a Development Databases to validate the requirements and benchmarking the ETL loads. Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena. Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Used Pandas in Python for Data Cleansing and validating the source data. Deployed the Big Data Hadoop application using Talendon cloud AWS (Amazon Web Sevices) and also on Microsoft Azure After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application using Spark streaming and kafka. Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.Environment: Hortonworks, Hadoop, HDFS, AWS Glue, AWS Athena, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Scala, Spark, SparkSQL, AWS, SQL Server, Tableau, KafkaClient: Broadway Bank, San Antonio, TX Aug 2021 Dec 2022Data Engineer Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse. Wrote various data normalization jobs for new data ingested into Redshift. Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin. Created various complex SSIS/ETL packages to Extract, Transform and Load data Was responsible for ETL and data validation using SQL Server Integration Services. Defined and deployed monitoring, metrics, and logging systems on AWS. Connected to Amazon Redshift through Tableau to extract live data for real time analysis. Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries. Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics. Implementing and Managing ETL solutions and automating operational processes. Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics. Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems Having experience in developing a data pipeline using Kafka to store data into HDFS. Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin. Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS). Strong understanding of AWS components such as EC2 and S3 Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done. Measured Efficiency of Hadoop/Hive environment ensuring SLA is met Optimized the Tensor Flow Model for efficiency Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes Collaborate with team members and stakeholders in design and development of data environment Preparing associated documentation for specifications, requirements, and testingEnvironment: AWS, EC2, S3, DynamoDB, Redshift, Kafka, SQL Server, Erwin, Oracle 10g/11g, Informatica, RDS, NOSQL, Snow Flake Schema, MySQL, PostgreSQL, Tableau, Git Hub.Client: Hyundai auto ever America - Fountain valley, CA Mar 2019 July 2021Data Engineer Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS. Used Sqoop to import and export data from HDFS to RDBMS and vice-versa. Exported the analyzed data to the relational database MySQL using Sqoop for visualization and to generate reports. Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries. Implemented SQOOP for large dataset transfer between Hadoop and RDBMs. Processed data into HDFS by developing solutions. Created Map Reduce Jobs to convert the periodic of XML messages into a partition avro Data. Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts. Created HBase tables to load large sets of structured data.
Managed and reviewed Hadoop log files.
Used Sqoop widely in order to import data from various systems/sources (like MySQL) into HDFS. Created components like Hive UDFs for missing functionality in HIVE for analytics. Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various. Used different file formats like Text files, Sequence Files, Avro. Cluster co-ordination services through Zookeeper. Worked extensively with HIVE DDLs and Hive Query language (HQLs). Analyzed the data using Map Reduce, Pig, Hive and produce summary results from Hadoop to downstream systems. Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS. Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, HBase, Shell Scripting, Oozie, Oracle 11g. Client: Metamor Software Solutions, Hyderabad, India Mar 2016 Nov 2018 Data Engineer Loaded data from MySQL server to the Hadoop clusters using the data ingestion tool Sqoop. Extensively worked with PySpark / Spark SQL for data cleansing and generating Data Frames and RDDs. Involved in creating Hive tables, loading with data and writing hive queries on top of data present in HDFS. Worked on tuning the performance Pig queries. Involved in Developing the Pig scripts for processing data. Written Hive queries to transform the data into tabular format and process the results using Hive Query Language. By using Apache Flume loaded real time unstructured data like xml data, log files into HDFS. Processed large amount both structured and unstructured data using MapReduce framework. Designed solution to perform ETL tasks like data acquisition, data transformation, data cleaning and efficient data storage on HDFS Developed Spark code using Scala and Spark Streaming for faster testing and processing of data. Store the resultant processed data back into Hadoop Distributed File System. Applied machine learning algorithms (K- nearest Neighbors, random forest) using Spark MLib on top of HDFS data and compare the accuracy between the models. Used Tableau to get the visualizations on data outcome from the ML algorithms.Environment: Apache Sqoop, Apache Flume, Hadoop, MapReduce, Spark, Hive, pig, Spark MLib, Tableau |