| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidateCandidate's Name
LINKEDIN LINK AVAILABLEEMAIL AVAILABLEPHONE NUMBER AVAILABLEPROFESSIONAL SUMMARYSr. Data Engineer with over 7+ years of experience, proficient in crafting, developing, and constructing data architecture, database systems and ETL data pipelines. Specialized in utilizing tools such as Spark, Hadoop, Airflow, Databricks, Snowflake, and cloud services from both AWS and Azure to drive data transformation and enrichment.Extensively experienced in developing Analytical solutions & applications using the Pyspark for the insights of the data pattern.Proficient in real-time data processing using Spark Streaming and Kafka enabling rapid data analysis and insights extraction from streaming data sources.Expertise in optimizing Spark job performance using SparkUI and the techniques like partitioning, Salting, Coalesce, Repartition and bucketing for the efficient data transformations.Expertise in Spark Architecture, Spark Core, Spark SQL, Data Frames, and Spark Streaming, with a proven track record of handling complex Hive queries and conducting data manipulations.Proficient in Python, XML, JSON processing and data exchange, utilizing Python OpenStack APIs for numerical analysis and interfacing with third-party APIs through REST and SOAP.Extensive experience in managing large datasets using Pandas data frames and various Relational Databases (RDBMS) such as MySQL, Oracle, and PostgreSQL.Proficient in applying query optimization techniques in Snowflake, including query profiling, query plan analysis, and query rewriting, to enhance query execution time and optimize resource utilization.Implemented REST APIs in Python using Flask and SQLALCHEMY for data center resource management on OpenStack, streamlining data operations.Proficient in real-time data processing and streaming analytics using Spark Streaming and Kafka, driving rapid insights from live data streams.Proficient in Airflow, AWS, Azure, and other cloud technologies, facilitating seamless data migration and integration across cloud platforms.Experience in Apache Airflow to author workflows as directed acyclic graphs (DAGs), to visualize batch and real - time data pipelines running in production, monitor progress, and troubleshooting the failure DAGs.Expertise in data wrangling, data cleansing, and data transformation using Python libraries like Pandas and NumPy, ensuring data quality and accuracy.Experience in converting Hive/SQL queries into Spark transformations using Spark RDDs and Data Frames.Experienced with version control systems like Git, GitHub, CVS, and SVN to keep the versions and configurations of the code organized.Expertise in containerization technologies such as Docker, creating and optimizing Docker images for efficient data processing and ETL tasks within Kubernetes environments.Highly skilled in a wide range of AWS services, including S3, DynamoDB, Lambda Functions, Step Functions, Glue, Quick Sight, EC2, EBS, ELB, CloudWatch, CloudTrail, SNS, SES, SQS, API Gateway, Redshift, EMR, Kinesis, and AWS CloudFormation templates.Proficient in a diverse range of Azure Cloud services encompassing IaaS & PaaS, with expertise in Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis Services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake, Management tools, Migration, Storage, Network & Content Delivery, Active Directory, Azure Container Service, Azure Event Bridge, Azure Event Hub and VPN Gateway.Skilled in designing and developing interactive dashboards and reports using Tableau and Power BI, delivering valuable data-driven insights for business stakeholders.Strong problem-solving and analytical skills with a keen eye for detail, coupled with excellent communication and team collaboration abilities.Experience in Databricks to architect, implement, and optimize end-to-end data processing workflows, integrating with various cloud services such as Azure Blob Storage and AWS S3.TECHNICAL SKILLSBig data TechnologiesPyspark, Hadoop, HDFS, Spark, Hive, Sqoop, Oozie, Apache Pig, MapReduce, Kafka, Airflow.Relational DatabasesMySQL, Oracle, Microsoft SQL Server, PostgreSQL.Non-Relational DatabasesMongo DB, Cassandra.DatawarehouseSnowflake, Teradata.Cloud Technologies AWSAWS Databricks, S3, EMR, Glue, EC2, SQS, Redshift, CloudWatch, Kinesis, GCP, BigQuery, Dataproc.Cloud Technologies AzureAzure Databricks, Azure SQL DB, Azure Container Service, Azure Functions, Azure SQL Data Warehouse, Azure Active Directory, Azure HD insights, Azure Event Hub.Programming languagesPython, HiveQL, PySpark, Scala, SQL, R.IDEsVisual Studio Code, Eclipse, PyCharm, Jupyter notebook, Google Colab, Spark UI.BI ToolsPower BI, Tableau.Python LibrariesMatplotlib, Pandas, NumPy, Scikit-learn, SciPy, PyTorch, Keras, TensorFlow.CI/CD toolsGit, Github, Kubernetes, Bamboo, Jenkins.ML AlgorithmsRegression Algorithms (Logistic regression), SVMs, Random Forest, K-means clustering, PCA, Neural Networks.PROFESSIONAL EXPERIENCESenior Data EngineerComcast, Philadelphia - Pennsylvania February, 2022 - PresentResponsibilities:Developed data processing ETL pipelines using PySpark, involving data reading from external sources, merging, data enrichment, and loading into Azure Sql data warehouse.Developed PySpark scripts to process streaming data from data lakes using Spark Streaming, enabling real-time data processing capabilities, and performed data transformations, cleaning, and filtering with Spark Data Frame API to efficiently load processed data into Hive.Leveraged Hive Meta data store backup, partitioning, and bucketing techniques to optimize Spark job performance and tuning.Possess a strong understanding of Spark Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, and related components such as Driver Node, Worker Node, Stages, Executors, and Tasks.Handled complex Hive queries, conducting table joins for extracting meaningful insights related to Spark jobs.Converted Hive SQL queries into Spark transformations using Spark Data Frames and Python, analyzing SQL scripts to design efficient PySpark solutions. Conducted performance tuning of Spark Applications, optimizing Batch Interval time, Parallelism, and memory settings for enhanced efficiency.Implemented hybrid connectivity between Azure and on-premises environments using virtual networks, VPN, and Express Route.Configured the monitoring and logging solutions using Azure Monitor, Application Insights, and Azure Log Analytics to gain insights into Azure Kubernetes Service cluster performance and application health.Successfully migrated SQL databases to Azure Data Lake, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse, using Apache Airflow ensuring seamless database access and migration.Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like rational and unstructured data to meet business functional requirements.Configured Snowpipe for the continuous data flow from the Azure data lake to the external stage to the Snowflake staging layer.Integrated Azure Kubernetes Services (AKS) with Azure services such as Azure Blob Storage, Azure SQL Database, and Azure Data Lake Storage for data storage, processing, and analytics purposes.Implemented end-to-end data extraction, transformation, and loading (ETL) processes using Azure Data Factory (ADF) and Azure HDInsight.Proficiently managed Azure Data Lakes (ADLS) and Data Lake Analytics, integrating them with other Azure services such as Azure Synapse Analytics (formerly Azure SQL Data Warehouse), Azure Blob Storage and Azure Databricks for data transformation and orchestration.Implemented CI/CD pipelines using Azure DevOps, Jenkins, or other tools to automate the deployment of data processing applications onto Azure Kubernetes service clusters.Responsible for estimating cluster size, monitoring, and troubleshooting of the Spark cluster, ensuring smooth operations in Azure Databricks.Developed interactive reports and dashboards using Power BI, offering customized parameters, and presenting insights through tables, graphs, and listings.Established end-to-end Continuous Integration and Continuous Deployment (CI/CD) pipelines on Azure DevOps, automating software delivery processes.Environment: PySpark, Hive, Spark, Spark SQL, Spark Data Frame API, Azure Data Lake, Azure SQL Database, Azure Data Factory, Azure HDInsight, Apache Airflow, Python, Azure Synapse Analytics, Azure Blob Storage, Azure Databricks, Azure DevOps, Power Bi.Data EngineerPluto TV, Los Angeles - California October, 2019 December, 2021Responsibilities:Developed PySpark applications for encrypting raw data using hashing algorithms on client-specified columns, ensuring data security and privacy.Utilized Spark SQL API in PySpark for data extraction, loading, and performing SQL queries, transforming existing Hive-QL queries into Python-based transformations with optimized techniques.Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue with Python and Spark, streamlining data storage and access.Leveraged Spark-Streaming APIs to perform necessary data transformations and actions on Kafka-sourced data, persisting the results into Aws Data Lake.Developed Spark applications capable of handling data from RDBMS (MySQL, Oracle Database) and streaming sources, facilitating efficient data processing.Led the development of a data pipeline with AWS to extract data from weblogs and store it in Amazon EMR, ensuring seamless data integration.Implemented data warehousing solutions on Amazon Redshift, optimizing query performance and enabling data-driven decision-making through rapid data retrieval and analysis.Developed custom Spark scripts in Python for data transformations, employing RDDs to efficiently manipulate data and perform actions on RDDs, and worked with diverse data formats such as Parquet, ORC & AVRO file formats for data import.Enhanced cloud infrastructure by integrating Google Cloud's VPC with AWS's VPC, facilitating secure, high-performance data exchange across hybrid cloud resources.Developed and deployed data processing pipelines on Amazon EMR (Elastic MapReduce) to efficiently process and analyze datasets, leveraging the distributed computing and AWS infrastructure.Implemented real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system, enabling real-time data analysis and insights.Developed PySpark data processing tasks, including data reading from external sources, data merging, enrichment, and loading into target data destinations.Optimized the performance issues in PySpark, optimizing groupings, joins, and aggregation functions, and scheduled PySpark jobs in a Cluster using AWS Databricks.Developed interactive dashboards and reports using Tableau Visualizations, incorporating bar graphs, line diagrams, funnel charts, scatter plots, pie-charts, bubble charts, heat maps, and tree maps to meet business requirements.Environment: Pyspark, SQL API, Lambda function, Python, Spark SQL, HDFS, Spark Streaming, Kafka, Hive, Data Frames, Sqoop, Amazon EMR, AWS Redshift, AWS Glue, AWS Databricks, Oracle, S3 buckets, Tableau.Data EngineerJohnson and Johnson, Raritan - New Jersey March, 2018 July, 2019Responsibilities:Analyzed and processed large and critical datasets using Spark, Hive, Hive UDFs, MapReduce, Sqoop, Pig, and HDFS, ensuring efficient data handling.Developed Spark Jobs for data validation, standardization, and transformation from Data center, enhancing data quality and usability.Utilized SparkSQL to load JSON data, create schema RDDs, and load them into Hive Tables, effectively managing structured data with SparkSQL.Conducted complex Hive queries, extracting meaningful insights related to Spark jobs by performing table joins and data manipulations.Developed Python scripts and UDFs using DataFrames and MapReduce in Spark, enabling data aggregation, querying, and writing data back into HDFS.Performed data cleansing, enrichment, mapping, and automated data validation processes to ensure the accuracy and reliability of reported data.Designed and created Hive external tables with shared meta-store, implementing static partitioning, dynamic partitioning, and bucketing strategies.Utilized Informatica to design and implement ETL (Extract, Transform, Load) workflows, enabling seamless data integration and migration across diverse systems.Conducted data wrangling using libraries like Pandas, transforming and reshaping data for analysis and reporting purposes.Analyzed data using SQL, Python, Apache Spark, and presented analytical reports to both management and technical teams, facilitating data-driven decision-making.Imported data from AWS S3 into Spark RDD, applying necessary transformations and actions on RDDs to extract valuable insights.Created a data pipeline for ingestion, aggregation, and loading of consumer response data from AWS S3 bucket into Hive external tables in HDFS, serving as a feed for Tableau dashboards.Utilized Airflow Operators and other Python libraries for data orchestration and data ingestion, streamlining the data processing workflow.Scheduled Airflow DAGs to execute multiple Hive and Spark jobs, ensuring smooth and timely data processing based on time and data availability.Played a pivotal role in various transformation and data cleansing activities, implementing control flow and data flow tasks in SSIS packages during data migration.Developed SQL queries, stored procedures, common table expressions (CTEs), and temporary tables to support SSRS and Power BI reports, enabling efficient data retrieval and visualization.Environment: PySpark, Hive, Sqoop, Hive UDFs, MySQL, Linux, Python, Airflow, Airflow DAGs, Json, Apache Spark, Informatica, Power BI, RDBMS.Python DeveloperDMART, Mumbai - India January, 2017 March, 2018Responsibilities:Involved in the entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementation and support.Developed web-based applications using PHP, XML, JSON, and MVC3, utilizing Python scripts for database updates and file manipulation, enhancing application functionality.Built database models, APIs, and views with Python to create interactive solutions, contributing to seamless user experiences.Proficient in designing and developing the presentation layer of web applications, leveraging technologies such as HTML, CSS, JavaScript, JQuery, AJAX, and Bootstrap, ensuring visually appealing and responsive interfaces.Developed XML documents and frameworks for efficient parsing of XML documents, streamlining data processing.Utilized Python for XML and JSON processing, facilitating data exchange and seamless business logic implementation.Worked on Python OpenStack APIs and performed numerical analysis using Numpy, effectively integrating with third-party APIs through REST and SOAP interfaces.Managed large datasets using Pandas data frames and various Relational Databases (RDBMS) such as MySQL, Oracle, and PostgreSQL, ensuring efficient data handling.Played a key role in implementing REST APIs in Python using micro-framework like Flask with SQLALCHEMY backend for data center resource management on OpenStack.Collaborated in the development of applications and utilized Jenkins continuous integration tool for project deployment, effectively managing version control using GIT.Improved classification accuracy by 10% through fine-tuning logistic regression models, resulting in more precise identification of target outcomes in a customer churn prediction project.Increased precision and recall rates by 15% in a fraud detection system by implementing and optimizing SVM models, leading to a more effective identification of suspicious transactions.Enhanced prediction accuracy in a recommendation engine by employing Random Forest models to analyze user behavior and provide more personalized content suggestions.Reduced customer segmentation errors through the implementation of K-means clustering, resulting in more targeted marketing strategies and improved customer satisfaction.Actively participated in Agile Methodologies and SCRUM process, promoting efficient project management and successful project delivery.Achieved a 15% increase in the F1 score for a sentiment analysis model by fine-tuning neural networks using TensorFlow and Keras, leading to more accurate sentiment predictions in social media data.Accelerated data processing speed by 25% in a large-scale image recognition project by applying PCA for dimensionality reduction, maintaining model accuracy while significantly reducing computational resources.Performed data Visualization on survey data using Tableau Desktop and Univariate Analysis using Python (Pandas, NumPy, Seaborn, Sklearn, and Matplotlib).Environment: Python, XML, MySQL, Apache, HTML, CSS, JavaScript, Shell Scripts, Oracle, PostgreSQL, REST, SOAP, JSON, DJANGOEDUCATIONWilmington University Masters in Information Systems and Technology 2023 |