| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidate POOJASENIOR DATA ENGINEERPH: PHONE NUMBER AVAILABLE |Email: EMAIL AVAILABLE | LinkedInPROFESSIONAL SUMMARY Around 10+ years of IT experience, including Analysis, Design, and Development of Big Data using Hadoop, web applications using Python, and data warehousing development using My SQL, Oracle, and Informatica. Solid experience in Big Data technologies with hands-on experience in installing, configuring, and using ecosystem components like Hadoop MapReduce, HDFS, HBase, Zookeeper, Hive, Flume, Snowflake, Kafka, and Spark. Experience in Agile methodologies, Scrum stories, and sprints in a Python-based environment. Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analysing & transforming the data to uncover insights into customer usage patterns.
Well-versed with big data on AWS cloud services including EC2, S3, EMR, Glue, Lambda, DynamoDB, and RedShift. Developed data ingestion modules using AWS Step Functions, AWS Glue, and Python modules. Prepared scripts to automate the ingestion process using Spark and Scala as needed through various sources such as APIs, AWS S3, Teradata, and Redshift. Good Understanding of Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory, Azure Synapse, and created POC in moving the data from flat files and SQL Server using U-SQL jobs. Experience in building the Orchestration on Azure Data Factory for scheduling purposes. Proficient in Azure Data Factory to perform incremental loads from Azure SQL DB to Azure Synapse. Responsible for estimating the cluster size, monitoring, and troubleshooting the Spark Databricks cluster. Developed Spark programs with Python and applied principles of functional programming to process complex structured datasets. Expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python, and Java. Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, and DynamoDB by installing and configuring various packages in Python. Experience in using Data Modelling techniques to find the results based on SQL and PL/SQL queries. Worked with databases such as Oracle, SQL Server, and MySQL, writing stored procedures, functions, joins, and triggers for different Data Models. Experience in Setting up the build and deployment automation for Terraform scripts using Jenkins. Proficient in SQLite, MySQL, and SQL databases with Python. Acute knowledge of Spark Streaming and Spark Machine Learning Libraries. Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines. Experience using Cloudera Manager to install and manage single-node and multi-node Hadoop clusters (CDH4&CDH5). Familiar with data architecture including data ingestion pipeline design, Hadoop architecture, data modeling, data mining, and advanced data processing. Experience optimizing ETL workflows. Good Exposure to Data Quality, Data Mapping, and Data Filtration using Data warehouse ETL tools like Talend, Informatica, and DataStage. Worked with Text, Sequence files, XML, Parquet, JSON, ORC, AVRO file formats, and Click Stream files. Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Power BI. Designed and implemented various SharePoint Features, Content Types, Lists, User Permissions, and Solution Packages. Designed and built dimensions and cubes with star schema using SQL Server Analysis Services (SSAS). Experience in branching, tagging, and maintaining the version across the environments using SCM tools like Git, GitLab, GitHub, and Subversion (SVN) on Linux and Windows platforms. Hands-on experience with UNIX and shell scripting to automate scripts. Experience with tracking tools like JIRA and Service now. Well-versed in building CI/CD pipelines with Jenkins, used tech stacks like Gitlab, Jenkins, Helm, and Kubernetes.TECHNICAL SKILLS Big Data Technologies: HDFS, Hive, MapReduce, Hadoop distribution, HBase, Spark, Kafka Streaming, Scala, ETL Programming languages: R, Scala, Terraform, Python, Java Databases: MySQL, MS-SQL Server 20012/16, Oracle 10g/11g/12c, Teradata No SQL: Mongo DB, Cassandra, HBase Scripting/Web Languages: HTML5, XML, SQL, Shell/Unix, Python. Operating Systems: Linux, Windows XP/7/8/10, Mac. Software Life Cycle: SDLC, Waterfall, and Agile models. Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, ANT, Maven,Alteryx, Visio, Jenkins, Jira, IntelliJ, bamboo. Data Visualization Tools: Tableau, SSRS, Cloud Health. Cloud Services: AWS (EC2, S3, EMR, RDS, Lambda, CloudWatch, Auto scaling, Redshift, Cloud Formation, Glue etc.), Azure (Databricks, Azure Data Lake, Azure HDInsight)PROFESSIONAL EXPERIENCEEmc Insurance, lowa May 2023 to PresentSenior Data EngineerResponsibilities: Business requirement gathering and translating them into clear and concise specifications and queries. Developed Spark applications using Spark - SQL in Data bricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns. Involved in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation. Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor. Performed operations on AWS using EC2 instances, and S3 storage, performed RDDs, and analytical Redshift operations, and wrote various data normalization jobs for new data ingested into Redshift by building a multi-terabyte data frame. Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming. Developed Spark Applications using Python and implemented Apache Spark data processing projects to handle data from various RDBMS and Streaming sources. Worked with Spark to improve the performance and optimization of the existing algorithms in Hadoop. Used Spark Streaming APIs to perform transformations and actions on the fly for building common data pipelines. Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data. Wrote standard SQL Queries to perform data validation and gathered analytical data to develop functional requirements using data modeling and ETL tools. Responsible for estimating the cluster size, monitoring, and troubleshooting the Spark Databricks cluster. Developed Kafka consumer API in Python for consuming data from Kafka topics. Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark Streaming to capture User Interface (UI) updates. Developed a Pre-processing job using Spark Data Frames to flatten JSON documents to a flat file. Loaded D-Stream data into Spark RDD and performed in-memory data Computation to generate output response. Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipeline system. Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data processing and storage. Involved in Maintaining the Hadoop cluster on AWS EMR. Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Configured Snowpipe to pull the data from S3 buckets into Snowflake's table. Stored incoming data in the Snowflake's staging area. Created numerous ODI interfaces and loaded them into Snowflake DB. Worked on Amazon Redshift for consolidating all Data warehouses into one Data warehouse. Used Hive to perform data analysis to meet business specification logic. Used Apache Kafka to aggregate web log data from multiple servers and make it available in Downstream systems for Data analysis. Developed Custom UDFs in Python and used UDFs for sorting and preparing data. Worked on Custom Loaders and Storage Classes in PIG to handle various data formats like JSON, XML, CSV, and generated Bags for processing using PIG, etc. Developed Sqoop and Kafka Jobs to load data from RDBMS, and External Systems into HDFS and HIVE. Developed Oozie coordinators to schedule Hive scripts to create Data pipelines. Leveraged existing PL/SQL scripts for daily ETL operations. Developed ETL pipelines using Matillion for data integration and transformation tasks. Created Matillion jobs to orchestrate complex data workflows, integrating various data sources into a cohesive data warehouse environment.
Optimized Matillion ETL processes to improve performance and reduce processing time, ensuring timely and accurate data availability. Implemented data transformations and data quality checks within Matillion to ensure data integrity and consistency across the data pipeline.
Collaborated with data architects and business analysts to design and implement data models and ETL processes using Matillion. Well-versed in building CI/CD pipelines with Jenkins, used tech stacks like Gitlab, Jenkins, Helm, and Kubernetes. Worked in an Agile (Framework: SCRUM) development environment.Environment: AWS, Spark, SQL, AWS EMR, S3, Snowflakes, Hive, Pig, Apache Kafka, ETL, Informatica, Sqoop, Python, PySpark, Shell scripting, Linux, MySQL, Jenkins, Git, Oozie, Tableau, and Agile Methodologies.Amway Corp ADA, MI July 2021 to April 2023Senior Data Engineer
Responsibilities: Proficient in working with Azure cloud platforms (HDInsight, Data Lake, Data Bricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer). Designed and deployed data pipelines using Data Lake, Data Bricks, and Apache Airflow. Worked on Azure Data Factory to integrate data from both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse. Involved in Spark Scala functions for mining data to provide real-time insights and reports. Configured spark streaming to receive real-time data from the Apache Flume and store the stream data using Scala to Azure Table. Used Data Lake for storage and processing of all types of analytics analytics. Ingested data into Azure Blob storage and processed it using Data Bricks. Involved in writing Spark Scala scripts and UDFs to perform transformations on large datasets. Loaded the tables from the Azure data lake to Azure blob storage to push them to Snowflake. Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and improved the cluster performance. Used Spark Data Frames to create Various Datasets and applied business transformations and data cleansing operations using Data Bricks Notebooks. Skilled in writing Python scripts to build ETL pipelines and Directed Acyclic Graph (DAG) workflows using Airflow, and Apache NiFi. Responsible for Migration of key systems from on-premises hosting to Azure Cloud Services. Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API). Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS. Loaded data from Web servers and Teradata using Sqoop, Flume, and Spark Streaming API. Used Flume sink to write directly to indexers deployed on the cluster, allowing indexing during ingestion. Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL). Managed resources and scheduling across the cluster using Azure Kubernetes Service. Used AKS to create, configure, and manage a cluster of Virtual machines. Configured to receive real-time data from Apache Kafka and store the stream data to HDFS using Kafka Connect. Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics, and machine learning applications. Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication, and Apache Ranger for authorization. Experience in working with Spark applications like batch interval time, level of parallelism, and memory tuning to improve processing time and efficiency. Used Scala for amazing concurrency support, and Scala plays a key role in parallelizing processing of the large data sets. Developed rest APIs using Python with Flask and Django framework and the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files. Developed map-reduce jobs using Scala for compiling the program code into bytecode for the JVM for data processing. Used Jira for bug tracking and Bit Bucket to check in and check code changes. Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements. Extensively worked on Jenkins to implement continuous integration (CI) and Continuous deployment (CD) processes. Worked in Agile Methodology and used JIRA to maintain the stories about the project.Environment: Azure HDInsight, Data Bricks, Data Lake, Cosmos DB, MySQL, Snowflake, MongoDB, Teradata, Flume, Power BI, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop (HDFS, MapReduce, Yarn), Spark, Airflow, Hive, Sqoop, HBase, Kubernetes, Jira.Nationwide, Columbus, OH November 2018 to June 2021Data Engineer
Responsibilities: Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams, enforced referential integrity constraints, and created logical and physical models using Erwin. Developed Python scripts to automate the data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency. Defined and deployed monitoring, metrics, and logging systems on AWS. Analyze the existing application programs and tune SQL queries using an execution plan, query analyzer, SQL Profiler, and database engine tuning advisor to enhance performance. Developed SSRS reports, and SSIS packages to Extract, Transform and Load data from various source systems. Implementing and Managing ETL solutions and automating operational processes. Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics. Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics. Worked on Big data on AWS cloud services i.e., EC2, S3, EMR, and DynamoDB Managed security groups on AWS, focusing on high availability, fault-tolerance, and auto-scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Developed code to handle exceptions and push the code into the exception Kafka topic. Was responsible for ETL and data validation using SQL Server Integration Services. Used the Oozie Scheduler system to automate the pipeline workflow and orchestrate the map to reduce jobs that extract the data on time. Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS). Used Hive SQL, Presto SQL, and Spark SQL for ETL jobs and used the right technology for the job to get done.Environment: SQL Server, Kafka, Python, MapReduce, Oracle 11g, AWS, Redshift, Informatica RDS, NOSQL, Terraform, PostgreSQL.Grapesoft Solutions Hyderabad, India Nov 2016 to July 2018Data Engineer/ Hadoop EngineerResponsibilities: Written Map-Reduce code to process all the log files with rules defined in HDFS (as log files generated by different devices have different XML rules). Developed and designed an application to process data using Spark. Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE for efficient data access. Developed Simple to Complex Map Reduce Jobs using Hive and Pig. Involved in running Hadoop Jobs for processing millions of records of text data. Developed Job Scheduler scripts for data migration using UNIX Shell scripting. Installation & Configuration Management of a small multi-node Hadoop cluster. Implemented Copy activity and custom Azure Data Factory Pipeline Activities for On-cloud ETL processing. Created Notebooks using Databricks, Scala, and Spark and captured the data from Delta tables in Delta lakes. Installation and configuration of other open-source software like Pig, Hive, Flume, and Sqoop. Created data bricks notebooks using Python (PySpark), Scala, and Spark SQL for transforming the data that is stored in Azure Data Lake stored Gen2 from Raw to Stage and Curated zones. Developed programs in Python, and Scala-Spark for data reformation after extraction from HDFS for analysis. Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data. Developed job workflows in Oozie to automate the tasks of loading the data into HDFS. Writing the script files for processing data and loading to HDFS Importing and exporting data into Impala, HDFS, and Hive using Sqoop. Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward. Capture an image of a Virtual Machine. Attaching a Disk to a Virtual Machine. Manage and create Virtual Networks and Endpoints in Azure Portal. Writing a Data Bricks code and ADF pipeline fully parameterized for efficient code management. Developed the application by using the Struts framework. Performed several ad-hoc data analyses in Azure Data Bricks Analysis Platform on the KANBAN board. Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS. Implemented multiple Map Reduce Jobs in Java for data cleansing and pre-processing. Moved all RDBMS data into flat files generated from various channels to HDFS for further processing.Environment: Hadoop, MapReduce, HDFS, Pig, Hive, Azure SQL, Azure Data Lake, Databricks, Data Storage, HDInsight, PowerBI, PL/SQL.Maisa Solutions Private Limited Hyderabad, India June 2014 to October 2016Data AnalystResponsibilities: Utilized technology such as MySQL and Excel PowerPivot to query test data and customize end-user requests. Optimized queries with modifications in SQL code removed unnecessary columns and eliminated data discrepancies. Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persist the data in HDFS. Utilized dimensional data modeling techniques and storyboarding ETL processes. Hands-on experience in analyzing the data and writing Custom MySQL queries for better performance, joining the tables, and selecting the required data to run the reports. Simulation application builds using basic HTML, CSS, JavaScript, and Bootstrap for the templates. Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules. Worked with business analysts to design weekly reports using a combination of Crystal Reports. Developed ETL procedures to ensure conformity, compliance with standards, and lack of redundancy, translated business rules and functionality requirements into ETL procedures using Informatica Power Center. Re-engineer existing Informatica ETL process to improve performance and maintainability. Automate different workflows, which are initiated manually with Python scripts and Unix shell scripting.Environment: Python, Informatica, Java, My SQL, UNIX, ETL, HDFS, AWS, SSAS, Spark, MS Office, MS Excel.EDUCATIONJawaharlal Nehru Technological University, Hyderabad, IN May 2014
Bachelor of Technology in Electronics and Communication Engineering |