| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidateCandidate's Name
Sr. Data Engineer/Scientist ETL Developer Big Data ExpertPhone: PHONE NUMBER AVAILABLE; Email: EMAIL AVAILABLEPROFESSIONAL SUMMARYStrategic Data Engineering Professional with 12+ years of experience of working in projects with Big Data, Data ETL and Hadoop.Led the migrations, installations, and development, worked towards restructuring the existing and added new functionalities to Strategic summary dashboards within GCP environment.Adept at validating reports and dashboards by writing the SQL queries within GCP.Demonstrated expertise of working in AWS using Data Lakes and Big Data ecosystems.Used BI tools like Tableau and PowerBI, data interpretation, modelling, data analysis, and reporting within AWS to deliver decision-making insights to the stakeholders.Extensive knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts within AWS.Diligently used Apache Hadoop for working with Big Data to analyse large datasets and deduce results while working with MapReduce programs.Designed and implemented logical and physical data models to optimize data storage, retrieval, and analysis.Implemented Agile methodology using data-driven analytics within AWS.Analyzed the MS-SQL data model in within GCP and on-premises environment and provided inputs for converting the existing dashboards that used Excel as a data source.Led the team to work on designing the scalable data architectures using services such as Azure Data Lake Storage, Azure Synapse Analytics, and Azure SQL Database.Led the performance optimization initiatives for Apache Spark, achieving significant efficiency gains across platforms such as Databricks, Glue, EMR, and on-premises systems, enhancing data processing capabilities and reducing processing times.Played a key role in leading the on-premises data migration into the cloud, used tools like Apache NiFi and Attunity to ensure smooth and efficient transition, resulting in enhanced scalability and accessibility of data resources.Mapped data flows between various systems and applications to improve data movement and minimize errors.Experienced in implementing CI/CD pipelines for efficient software deliveryExpertise in developing and maintaining scalable data pipelines and built new API integrations to support continuing increases in data volume and complexity.TECHNICAL SKILLSBig Data: Amazon AWS - EC2, S3, Kinesis, Azure, Cloudera Hadoop, Hortonworks Hadoop, Spark, Spark Streaming, Hive, Kafka, Snowflake, Databricks, NLPProgramming Languages: Scala, Python, Java, BashHadoop Components: Hive, Pig, Zookeeper, Sqoop, Oozie, Yarn, Maven, Flume, HDFS, AirflowHadoop Administration: Zookeeper, Oozie, Cloudera Manager, Ambari, YarnData Management: Apache Cassandra, AWS Redshift, Amazon RDS, Apache HBase, SQL, NoSQL, Elasticsearch, HDFS, Data Lake, Data Warehouse, Database, Teradata, SQL ServerThe architecture of ETL Data Pipelines: Apache Airflow, Hive, Sqoop, Flume, Scala, Python, Flume, Apache Kafka, LogstashScripting: HiveQL, SQL, Shell Script LanguageBig Data Frameworks: Spark and KafkaSpark Framework: Spark API, Spark Streaming, Spark SQL, Spark Structured StreamingData Visualization: Tableau, QlikView, PowerBISoftware Development IDE: Jupyter Notebooks, PyCharm, IntelliJ Continuous IntegrationCI-CD: Jenkins, Versioning: Git, GitHub, BitbucketMethod: Agile Scrum, Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario TestingWORK EXPERIENCEOctober 2022-Present: Berkshire Hathaway, Omaha, Nebraska as Principal Data EngineerProject Summary: In the Portfolio Optimization Project, I led the team to use big data analytics to optimize investment portfolios by analysing asset performance, market volatility, and risk factors, resulting in improved portfolio diversification and enhanced returns.Deliverables:Developed an ETL pipeline integrating snowflake in an on-premise setting and designed a robust ingestion architecture to support it.Led the development and implementation of a master data management strategy to ensure data consistency and accuracy.Managed distributed data processing to handle large volumes of financial data efficiently by using Apache Spark while using Spark SQL for querying structured data.Extracted valuable insights from big data using data mining techniques using Sparks Python API (PySpark) and ETL pipelines.Utilized Databricks Delta Live Tables to build reliable, end-to-end data pipelines.Provided technical leadership and mentorship to junior team members on data engineering best practices.Designed and implemented scalable data processing pipelines using Google Cloud Dataflow, Google Cloud Dataproc, and Google Cloud Dataprep.Conducted Market Segmentation using cluster analysis (Hierarchical Clustering and K-Means) and dimensionality reduction techniques to effectively segment the market by leveraging demographic and geographic data.Developed scalable data processing pipelines using Python-based frameworks like Apache Spark (PySpark), Apache Beam, and Dask.Integrated Google Cloud Dataflow, Google Cloud Dataproc, and Google Cloud Dataprep with other GCP services like Google Cloud Bigtable and Google Cloud Datastore.Developed customer personas based on various attributes, which allowed us to implement targeted marketing strategies and personalized customer experiences.Built customer churn predictive models using Logistic Regression, Random Forest, and XGBoost to identify potential customer churn.Developed and enforced data quality standards and processes to maintain data integrity and enable smooth data exchange.Integrated Google Cloud Storage, Google BigQuery, and Google Cloud Datastore to create a robust data lake and data warehouse architecture.Analyzed historical customer data and developed a robust model that accurately predicted churn, enabling us to implement proactive retention strategies.Collaborated with cross-functional teams to perform Customer Analytics and develop strategies to mitigate the risks and enhanced return within portfolios.Implemented complex data transformations and ETL (Extract, Transform, Load) workflows using Python Pyspark.Leveraged Databricks Lakehouse architecture to enable self-service analytics and data exploration.Implemented data quality monitoring and alerting mechanisms using Google Cloud Monitoring and Google Cloud Logging.Integrated Databricks with on-premises and hybrid data environments using Databricks Repos and Databricks Workflows.Utilized GPT 3.5 turbo, embedding models, Pinecone vector database, prompt engineering to generate market sentiment, customer sentiment based on micro and macro trends.Developed a Time Series model using ARIMA & LSTM to forecast stock prices in a portfolio which are dependent on numerous factors like supply and demand, company performance, the sentiment of the investors, etc.Developed and optimized Google Cloud Dataflow pipelines to ingest, transform, and load data from various sources into Google Cloud Storage and Google BigQuery.Championed and implemented data governance frameworks and policies to ensure data security, privacy, and accessibility.Participated in the evaluation and selection of data management tools and technologies.Worked on Snowflake with Apache Spark for distributed data processing, leveraging Snowflake's storage and Spark's processing capabilities to handle large financial datasets effectively.Captured real-time data from various events and used Apache Kafka for real-time data ingestion and stream processing, ensuring accurate and timely availability of market data for analysis.Obtained the accurate and regular flow of data, worked on Apache NiFi for data routing, resulting in smooth ETL operations across various data sources and destinations.Collaborated with cross-functional teams to understand business needs and translate them into technical data solutions.Implemented data quality monitoring and alerting mechanisms using Databricks Metrics and Databricks Alerts.Strategized data management through Google Cloud Platform (GCP) for scalable and cost-effective infrastructure to host our big data analytics solutions using GCP's Compute Engine and GCPs storage such as BigQueryLeveraged Python's NumPy and SciPy libraries for efficient numerical and scientific computing in big data applications.Used Google BigQuery as a primary data warehousing solution ingesting data through various methods including transactional database and external feeds and wrote SQL queries for analytics, enabling quick insights into portfolio data.Integrated Python-based data processing pipelines with various data sources, including relational databases, NoSQL databases, and data streaming platforms.Utilized SQL and various data manipulation tools to extract, transform, and load data.Used PySpark and Scala for large dataset processing, including RDD and data frames to transform data by using various APIs and libraries to implement advanced analytics and algorithm.Involved in PySpark and Scala for data processing and analytics tasks, taking advantage of Spark's distributed computing capabilities.Implemented data warehousing and data management concepts to support business intelligence and analytics.Utilized Google Cloud Dataproc to run Apache Spark and Apache Hadoop workloads for large-scale data processing and analytics.Incorporated feedback from stakeholders to refine ETL processes and improve the relevance and accuracy of analytics insights.Developed custom Databricks notebooks and libraries to address specific data processing requirements.Developed Python scripts and modules to automate the deployment and management of big data infrastructure using tools like Ansible, Terraform, and Docker.Designed and implemented data backup and disaster recovery strategies to ensure data resilience.Used TensorFlow and scikit-learn to build neural network models for analytics tasks to deduce decision making predictions such as asset prices, market trends, and risk assessment.Stored the large data sets in Apache Hadoop ecosystem using tools such as HDFS, MapReduce, and Hive for batch processing and data querying resulting in process optimization.Utilized Tableau and Google Data Studio to build interactive dashboards and visualizations to communicate portfolio performance metrics to the stakeholders for organizational decision-making.Integrated Jupyter Notebooks with Apache Zeppelin for exploratory data analysis and collaborative analytics workflows.Implemented Python-based data visualization and reporting solutions using libraries like Matplotlib, Seaborn, and Plotly.Described and informed valuable insights using matplotlib and seaborn libraries in Python by creating customized visualizations and charts conveying them effectively.Implemented monitoring solutions using Prometheus and Grafana to track data pipeline health, performance metrics, and resource utilization.Integrated alerting mechanisms using Slack and email notifications to promptly address any issues or anomalies detected in data processing pipelines.Collaborated with data scientists and analysts to refine portfolio optimization algorithms and models based on feedback and performance metrics, continuously improving the accuracy and effectiveness of the system.March 2020-October 2022: Walgreens Boots Alliance, Inc., Deerfield, IL as Lead Data/ Gen AI ArchitectProject Summary: In the Supply Chain Optimization, I contributed to data analytics to optimize various aspects of the supply chain, including procurement, transportation, warehousing, and distribution. By analysing supply chain data, our team identified bottlenecks, reduced costs, improved delivery times, and enhanced overall operational efficiency. As lead Gen AI Engineer I was tasked with a Natural Language Processing (NLP) platform to analyse customer communications. The idea was to gauge satisfaction levels and mine information to better create targeted marketing in the future. Also, the aim of the project was to enhance the overall shopping experience. The tool optimized product visibility in search results by contributing extracted text and applying NER as a contribution to the product recommendation system. I lead an interdisciplinary team that incorporated Data Engineering, Modelling, and ML-Ops resources. The system was deployed on AWS EKS as a microservice API for integration into the organization front end.Deliverables:Architected a data warehouse using AWS Redshift to organize make the data analytics jobs more efficient.Ingested the data from various sources using AWS services such as Amazon S3 and AWS Glue for smooth data acquisition.Designed, developed, and implemented cutting-edge Natural Language Processing (NLP) models and algorithms, focusing on tasks such as sentiment analysis, text classification, and entity recognition.Utilized TensorFlow, PyTorch, NLTK, and Spacy to seamlessly integrate NLP and deep learning techniques into e-commerce applications.Leveraged Lang Chain and embeddings model like BERT to efficiently split and embed various document formats, including PDF, HTML, and text files.Developed an innovative chunk provenance scheme to be integrated into the metadata of the Vector DB, enhancing data traceability and management.Implemented advanced prompt engineering techniques to generate system and user prompts for insertion into large language models (LLMs).Assessed LLM responses using metrics such as BLEU index, perplexity, and diversity to ensure high-quality output.Conducted thorough evaluations of Retrieval-Augmented Generation (RAG) effectiveness by testing relevance and veracity, ensuring the reliability of the generated content.Utilized Python's data serialization and deserialization capabilities to handle various data formats, including JSON, CSV, Parquet, and Avro.Worked on Snowflake in conjunction with AWS Glue to develop robust data pipelines for ETL processes, using Snowflake's storage and processing capablities for extracting, transforming, and loading supply chain data efficiently.Designed and implemented Python-based data quality validation and testing frameworks to ensure the integrity of big data pipelines.Successfully implemented parallel processing techniques to expedite ETL tasks, especially for handling large volumes of data, hence saving time.Cleaned and sorted the data using AWS Glue's serverless architecture and Spark-based processing engine for data quality improvement.Automated the deployment and management of Databricks resources using Infrastructure as Code (IaC) tools like Terraform.Developed Python-based data monitoring and alerting systems to proactively identify and address issues in big data environments.Leveraged master data management principles and best practices to improve data quality and consistency.Analysed the data using AWS services such as Amazon Redshift and Amazon Athena, also worked on interactive querying to deliver decision-making results.Utilized AWS Lambda functions to automate data processing tasks and trigger actions based on supply chain insights.Configured AWS CloudWatch for monitoring and alerting to ensure the smooth operation of supply chain analytics processes.Implemented Python-based data orchestration and workflow management solutions using tools like Apache Airflow and Luigi.Applied data quality methodologies and data cleansing techniques to maintain the integrity of data.Worked on AWS services such as Amazon Kinesis and AWS Lambda for real-time data processing and analysis along with Amazon S3 for storing and managing large datasets involving supply chain data securely helping retailers to make timely decisions based on changing supply chain conditions.Leveraged Python's data streaming and real-time processing capabilities using libraries like Apache Kafka (confluent-kafka-python).Integrated Databricks with other data and analytics services, such as Tableau, Power BI, and Amazon SageMaker.Skilfully visualized and reported supply chain metrics and KPIs by integrating AWS services such as Amazon QuickSight.Integrated AWS IoT services for real-time monitoring of inventory levels and transportation conditions to optimize supply chain operations.Contributed to the development and maintenance of data catalogs and data discovery platforms.Used AWS for data ETL processes and ML to optimize and analyse various aspects of supply chain resulting in improved efficiency of the process.Diligently monitored ETL processes and fine-tune them to maintain optimal performance levels.January 2017-March 2020: First Republic Bank, New York, NY as Senior ML/Cloud EngineerProject Summary: Worked with the team on the project to develop a fraud detection system using big data analytics to identify suspicious patterns and transactions in real-time. I used Machine learning algorithms to detect anomalies in transaction data, identifying potentially fraudulent activities, and triggering alerts for further investigation.Deliverables:Performed comprehensive data analysis involving the collection, cleaning, and analysis of extensive and intricate datasets. Employed statistical methods, data visualization, and exploratory data analysis techniques to discern patterns, trends, and anomalies, extracting meaningful insights.Designed and deployed data governance frameworks and procedures to ensure data security and compliance.Conducted quantitative research to formulate investment strategies, optimize trading algorithms, and assess market opportunities. Developed and refined quantitative models, conducted back-testing of strategies, performed statistical analyses, and explored new data sources to enhance investment decisions and trading performance.Built Artificial Neural Network models using Airflow for detecting anomalies and fraud in transaction data, ensuring fair representation of minority data through stratified imbalanced data techniques during cross-validation.Demonstrated excellent communication and collaboration skills to work effectively with technical and non-technical stakeholders.Designed and deployed real-time data streaming pipelines using Databricks Structured Streaming.Developed risk models and frameworks for assessing and managing various financial risks, including credit risk models, market risk models, operational risk models, stress testing frameworks, and scenario analyses. Ensured compliance with regulatory requirements and supported risk management practices.Developed Python-based data lake and data warehouse management utilities, including data partitioning, compaction, and optimization.Optimized data movement and integration between various systems and applications.Utilized Git for version control on GitHub to collaborate and work efficiently with team members.Ingested the accurate and real-time data using Azure services such as Azure Event Hubs, Azure Stream Analytics, and Azure Databricks.Analysed the ingested data post processing with Azure Databricks for data pre-processing, transformation, and feature engineering.Developed and maintained data dictionaries and metadata repositories to enhance data discoverability.Implemented Python-based data lineage and provenance tracking solutions to enhance data governance and auditability.Architected and deployed highly available and fault-tolerant big data solutions on Amazon EMR, leveraging Apache Spark and Apache Hadoop.Utilized HDInsight clusters on Azure for big data processing, analyzing market trends and competitor pricing strategies with tools like Apache Spark and Hadoop.Provisioned data and deployment of model by containerizing applications, and stored data securely by working on Azure services such as Azure Virtual Machines, Azure Kubernetes Service (AKS), and Azure Storage.Implemented data governance and security controls using Azure RBAC.Contributed to the development and maintenance of open-source Python libraries and frameworks for big data engineering.Worked with the team to develop and deploy machine learning models using Airflow.Worked on Azure services such as Azure Stream Analytics and Azure Functions to analyze streaming transaction data for suspicious patterns and anomalies.Implemented data lineage and impact analysis processes to track data provenance and dependencies.Utilized Databricks Repos and Databricks Workflows to orchestrate complex, multi-step data processing pipelines.Implemented data security and access control mechanisms to ensure data privacy and confidentiality.Implemented trigger alerts and notifications for potentially fraudulent activities detected in real-time with alerting mechanisms using Azure Monitor and Azure Logic Apps.Analysed streaming transaction data for suspicious patterns and anomalies by implementing real-time analytics solutions using Azure services such as Azure Stream Analytics and Azure Functions.For data security, implemented encryption, access controls, and data masking techniques to protect sensitive transaction data and comply with regulatory requirements such as GDPR and PCI-DSS.Involved in knowledge sharing sessions and workshops to share expertise, best practices, and lessons learned with team members and promote continuous learning and improvement.August 2015-January 2017: Bristol Myers Squibb, Princeton, New Jersey, as Big Data EngineerProject Summary: Worked with the team on Personalized Medicine project and contributed by utilizing big data analytics to analyse genetic, clinical, and lifestyle data to develop personalized treatment plans for patients. By identifying genetic markers and biomarkers associated with specific diseases or drug responses, we deduced results to further tailor treatments to individual patients, improving efficacy and reducing adverse reactions.Deliverables:Ensured secure storage of large amount of data for analysis using on-premises data storage solutions like HDFS and Apache Hbase.Utilized data virtualization and federation techniques to provide a unified view of data across disparate sources.Used patient demographics and clinical records, while storing raw genomic data and other unstructured data in HDFS by using HBase for real-time access to structured data.Developed data processing pipelines using Apache Spark and Apache Flink for complex analytics tasks.Integrated Databricks with cloud storage services like Amazon S3, Google Cloud Storage, and Azure Blob Storage to create a robust data lake architecture.Utilized AWS CloudTrail to maintain an audit trail of ETL processes and data transformations, ensuring traceability and accountability for data operations performed within the AWS environment.Designed and implemented data lake architectures to support big data analytics and machine learning.Provided technical leadership and mentorship to junior team members on the effective use of Python for big data engineering tasks.Utilized Spark for batch processing of genomic sequencing data and clinical records, performing variant calling, association studies, and genotype-phenotype correlation analysis.Employed Flink for real-time data processing and stream analytics, enabling continuous analysis of streaming data from medical devices and sensors.Leveraged data partitioning and indexing strategies to improve query performance and scalability.Worked on Spark's RDD (Resilient Distributed Dataset) and DataFrame APIs for distributed data processing, using Spark's in-memory caching and parallel execution capabilities.Integrated machine learning libraries such as Apache Mahout and TensorFlow into data processing pipelines for model training and evaluation.Participated in the development of data migration and ETL (Extract, Transform, Load) pipelines.Implemented data masking and anonymization techniques to protect sensitive information.Used Mahout for scalable machine learning algorithms and TensorFlow for deep learning models, leveraging on-premises computational resources for model training and inference.Worked on monitoring and management solutions using Apache Ambari and Cloudera Manager to monitor the health, performance, and scalability of the Hadoop cluster.Used Grafana and Prometheus for monitoring metrics such as resource utilization, data throughput, and job execution time.Successfully deployed machine learning models for predictive analytics, identifying genetic markers and biomarkers associated with diseases or drug responses.January 2012-August 2015: Subaru Corporation, Camden, New Jersey as Data Engineer/ Data AnalystProject Summary: I contributed to the ongoing project for Product Development and Testing. I contributed by using big data analytics to simulate and test vehicle performance, safety features, and fuel efficiency during the product development process to enable automotive companies to identify potential issues early and optimize designs before production.Deliverables:Configured and managed on-premises data centers to support the development and testing environment for vehicle simulation and analytics, using technologies such as VMware vSphere, Microsoft Hyper-V, or Citrix XenServer for virtualization.Developed and maintained data monitoring and alerting systems to proactively identify and address data quality issues.Implemented on-premises storage solutions to store large volumes of vehicle performance data, safety test results, and simulation outputs, and worked on technologies like EMC Isilon, NetApp FAS series, or Dell EMC Unity for scalable and reliable storage.Set up and maintained on-premises cluster computing environments to perform parallel processing of vehicle simulation data and analytics tasks.Designed and implemented data archiving and retention policies to optimize storage and compliance.Deployed technologies such as Apache Hadoop, Apache Spark, or Apache Flink for distributed data processing.Collaborated with software development teams to integrate vehicle simulation and testing tools with on-premises infrastructure and ensured compatibility and seamless operation of software applications using version control systems like Git or Subversion.Utilized data visualization and dashboarding tools to enable data-driven decision-making.Implemented security measures and compliance standards for protecting sensitive vehicle data and intellectual property, configured firewalls, access controls, and encryption mechanisms to safeguard on-premises infrastructure from unauthorized access or data breaches.Optimized the performance of on-premises infrastructure and applications to meet the high computational demands of vehicle simulation and testing.Fine-tuned hardware configurations, network settings, and software parameters to achieve optimal throughput and responsiveness.EDUCATIONMaster of Science in Systems Science (Data Visualization) from Parsons, The New SchoolBA in English Literature from the University of Iowa |