| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidate Senior Data Engineer
Sai Mani Teja Reddy
Phone:PHONE NUMBER AVAILABLE
Email:EMAIL AVAILABLE
PROFESSIONAL SUMMARY:
Dynamic and motivated IT professional with around 9 years of experience as a Data Engineer with expertise in
designing data-intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering,
Data Visualization, Reporting, and web application development using Python. Including in Pharma,
Telecommunications, Finance, Insurance, and Business domains.
Developed Spark applications using Spark-SQL in Databricks, extracting, transforming, and aggregating data
from diverse formats. My work unearths crucial insights into customer usage patterns.
Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing,
Auto Scaling, CloudWatch, SNS, SES, SQ, Lambda, EMR, and other services of the AWS family. Crafted automated
ingestion scripts using PySpark and Scala, seamlessly connecting with APIs, AWS S3,Teradata, and Redshift.
Extensive hands-on experience in developing PySpark applications for distributed data processing.
Implemented seamless integration between Python and Spark, leveraging PySpark's Python API.
Exhibited a deep understanding of Azure Big Data technologies, proficiently navigating Azure Data Lake
Analytics, Azure Data Lake Store, Azure Data Factory, and Azure Synapse. I've crafted Proof of Concepts for
efficient data migration. Managed tracking tools like JIRA and ServiceNow, displaying a strong ability to build
CI/CD pipelines through Gitlab, Jenkins, Helm, and Kubernetes.
Proficient in designing data warehouses, utilizing Azure Synapse capabilities for seamless integration and
analysis. Demonstrated expertise in leveraging Synapse pipelines for ETL process, ensuring smooth data
transforming and loading. Implemented Apache Airflow, authored, scheduled, and monitored Data Pipelines.
Designed and implemented Snowflake cloud data platform architectures, ensuring scalability, performance, and
reliability. Developed and maintained data models in Snowflake, ensuring alignment with business
requirements and optimizing schema designs.
Designed and managed big data platform using Hadoop ecosystem components such as HDFS, Oozie, Sqoop,
Apache Spark, and Hive to enable distributed and scalable data processing.
Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as
MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Experience in designing and creating RDBMS Tables,
Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions. Designed and
built Dimensions and cubes with star schemas using SQL Server Analysis Services (SSAS).
Exhibited proficiency in data modeling, leveraging SQL and PL/SQL queries for insightful results. I've seamlessly
worked across Oracle, SQL Server, and MySQL databases, writing stored procedures, functions, joins, and
triggers.
Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and
DataMigration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS. Excelled in Data
Quality, Mapping, and Filtration using ETL tools like Talend, Informatica, and DataStage.
Navigated a range of file formats and Click Stream files, while creating intricate Tableau reports and executing
ad-hoc reporting through Power BI. Led dashboard development and data analysis, revealing customer
purchasing trends. Created real-time Tableau dashboards, executed A/B tests, and collaborated with marketing
for analysis.
Engineered scripts, utilized Spark and Hive, and implemented machine learning models for strategic insights.
Led version control across Linux and Windows platforms using Git, GitLab, GitHub, and Subversion (SVN). My
Unix and shell scripting proficiency drives automation.
Developed robust and scalable back-end components using Python, Django, and Flask technologies. Managed
code versioning, participated in Agile, Scrum, and Kanban methodologies, and employed RESTful API
development, contributing to a positive team spirit and collaborative work environment.
TECHNICAL SKILLS:
AWS (Amazon EMR, S3, RedShift, EC2, IAM, CloudWatch, Glue, Snowflake, DevOps (CI/CD
Web Services) pipelines), RDS, Lambda, Boto3, DynamoDB, Amazon Sage Maker
Azure Services Data Factory, Databricks, Blob Storage, ADLS Gen 2, Delta Lake, Data Lake
Analytics, Functions, Synapse Analytics, Stream Analytics, SQL Database, SQL
Server, Monitor, Event Hubs, Cosmos DB, Azure Cosmos Graph, Cassandra,
MongoDB/Azure Cosmos
Big Data/ MapReduce, Spark, Spark SQL, Spark Streaming, Kafka, PySpark, Pig, Hive,
Hadoop HBase, Flume,Yarn, Oozie, Zookeeper, Hue, Ambari Server
Technologies
Databases Oracle, MySQL, SQL server, MongoDB, Cassandra, DynamoDB, PostgreSQL,
Teradata, Cosmos
Programming Python, Java, PySpark, Scala Shell Script, Perl Script, SQL, JavaScript, HTML5, Linux
ETL Tools Informatica, Apache Airflow, Azure Data Factory, Talend, Sqoop, Glue
Reporting and Tableau, Power BI, QlikView, Crystal Reports XI, SSRS (SQL Server Reporting
Visualization Services), Cognos,Excel
Tools
PROFESSIONAL EXPERIENCE:
OPTUM (Health Care) | Eden Prairie, Minnesota Sept 2022 Till Date
Senior Data Engineer
Responsibilities:
Data Governance process development, implementation, and management. Provided full-fledged big-data
processing within the Hadoop framework. Design of new ETL pipeline using Spark and Hive extracting data from
numerous sources.
Designed ad data store, conforming to OLTP archive policy in which to reverse share between ODS and OLTP.
Using ER/Studio for the creation of conceptual, logical, and physical data modeling and DDL scripts.
Imported data from real-time to Hadoop via Kafka with Oozie jobs daily. MapReduce programs designed for
analysing data and cleaning it. Maintaining the Hadoop Cluster on AWS EMR involvement.
I used AWS Glue and Pyspark to load data into s3 buckets. Use of filtering data in Elasticsearch to load data into
Hive external tables. Skilled in developing CI/CD pipelines using Jenkins, utilized tech stack such as Gitlab,
Jenkins, Helm, and Kubernetes.
Actively engaged in cross-functional discussions to align mapping strategies with REST API design principles
and industry best practices.
Implemented robust error handling mechanisms and ensured fault tolerance in PySpark applications. Managed
PySpark clusters for scalability and performance, integrating with cluster management tools.
Collaborated on cross-service integration projects, integrating AWS Glue with Amazon S3, Redshift, Athena, and
other AWS services. Implemented robust data quality checks and validation mechanisms within AWS Glue jobs.
Developed and implemented incremental data loads using Apache Spark SQL from source systems into Hadoop.
Maintained comprehensive documentation for Spark applications, including code comments and workflow
descriptions.
Applied expertise in HL7 and FHIR to ensure seamless communication and data exchange in healthcare
environments. Demonstrated the ability to troubleshoot and resolve issues related to HL7 and FHIR data
mapping, ensuring system reliability.
To provide the business specification logic, store into Hive, and perform data analysis. The use of Apache Kafka
to gather web log data from different servers and pass it on to the downstream system for Analysis of data.
Established comprehensive monitoring and logging for Amazon EMR clusters using Amazon CloudWatch.
Building Scala Api for backend support for Graph Database User Interface. Coded Scala Api to insert/Delete
predicates in Graph DB after Transforming and mapping incoming data.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary
Transformations and Aggregation on the fly to build the common learner data model and persists the data in
HDFS.
Integrated AWS Dynamo DB using AWS lambda to teh values items backup teh Automated Regular AWS tasks
like snapshots creation using Python scripts.
Optimized Tableau dashboards for performance, addressing factors such as data extracts, filters, and calculated
fields.
Environment: AWS, Spark, SQL, ER/Studio, Hadoop 3.3, AWS EMR, S3, Snowflakes, Hive, Pig, Apache Kafka, ETL,
Informatica, Sqoop, Python, PySpark, Shell scripting, Linux, MySQL, Jenkins, Git, HLS and FHIR, Oozie, Tableau, and
Agile Methodologies.
The Home Depot | Atlanta, GA May 2020 Aug 2022
Data Engineer
Responsibilities:
Proficient in working across a wide range of Azure services, including HDInsight, Data Lake, Data Bricks, Blob
Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer. Designed and deployed data
pipelines by leveraging Data Lake, Data Bricks, and Apache Airflow. Thesepipelines facilitated seamless data
integration, transformation, and orchestration.
Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure
Databricks.
Expertise in connecting and integrating various data sources, including cloud-based applications, databases, and
APIs, using StreamSets connectors.
Strong understanding of StreamSets' data governance and security features, including data masking, encryption,
and access controls.
Skilfully integrated data from both on-premises (MY SQL, Cassandra) and cloud sources (Blob Storage, Azure
SQLDB) using Azure Data Factory. Applied transformations to load data back to Azure Synapse for enhanced
insights.
Developed Spark Scala functions for real-time data mining, providing crucial real-time insights and generating
reports. Configured Spark Streaming to receive data from Apache Flume and stored it in Azure Table using Scala.
Utilized Azure Data Lake for comprehensive data processing and analytics. Employed Data Bricks for processing
data ingested into Azure Blob Storage, effectively utilizing Spark Scala scripts and UDFs for large-scale
transformations.
Loaded tables from Azure Data Lake to Azure Blob Storage, facilitating smooth data movement to Snowflake for
further analysis. Created complex SnowSQL scripts for reporting and business analysis in the Snowflake Cloud
Data Warehouse.
Utilized Spark Streaming API to ingest data from diverse sources, enhancing existing Scala code to improve
cluster performance and efficiency. Leveraged Spark Data Frames and Data Bricks Notebooks to create datasets
and apply business transformationsand data cleansing operations.
Worked with delta tables on Databricks. Hands on experience working with Databricks runtime and
performance tuning of spark jobs.
Proficiently developed ETL pipelines and Directed Acyclic Graph (DAG) workflows using Python scripts, Airflow,
and Apache NiFi.
Led the migration of critical systems from on-premises hosting to Azure Cloud Services, with a focus on Snow
SQL query writing and optimization.
Developed data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
Collaborated with Cosmos DB (SQL API and Mongo API) for seamless data integration.
Designed custom input adapters using Spark, Hive, and Sqoop for data ingestion and analysis from various
sources, including Snowflake, MS SQL, and MongoDB.
Implemented auto-scaling and serverless computing techniques, which reduced overall cloud infrastructure
expenses by 15%.
Possesses knowledge of SAS (Statistical Analysis System), utilizing it for advanced analytics and data processing
when required. Ensured data quality and validation using SAS, implementing checks and procedures to identify
and address discrepancies in datasets.
Developed RESTAPIs using Python with Flask and Django framework and the integration of various datasources
including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business
requirements.
Extensively worked on Jenkins to implement continuous integration (CI) and Continuous deployment (CD)
processes. Worked in Agile Methodology and used JIRA to maintain the stories about the project.
Environment: Azure HDInsight, ADLS, Azure Synapse, Azure Data Factory, Data Bricks, Data Lake, Cosmos DB,
MySQL, Snowflake, MongoDB, Palantir, Teradata, Flume, SAS, Blob Storage, Data Factory, ETL, Data Storage Explorer,
Scala, Hadoop (HDFS, MapReduce, Yarn), Spark, Flink, Airflow, Hive, Sqoop, HBase, Kubernetes, Jira, Tableau, Power
BI.
Wells Fargo | Minneapolis, MN Feb 2018 April 2020
Data Engineer
Responsibilities:
Led the creation of reporting dashboards, conducting data mining and analysis to understand customerpurchase
behavior.
Formulated real-time dashboards in Tableau, delivering visual monitoring of crucial metrics and executing A/B
test processing, leveraging both external and internal data.
Collaborated closely with the marketing team to dissect marketing campaign data, executing analysis
encompassing segmentation and cohort analysis.
Orchestrated the design of MySQL table schemas, subsequently implementing stored procedures for efficient
extraction and storage of customer purchase and session data.
Employed Python Packages Pandas and NumPy to query MySQL database, validating and identifyinginconsistent
data. Directed the design of A/B tests, defining metrics, calculating sample size, and assessing statistical
assumptionsto validate new user interface features.
Conducted diverse statistical analyses, including hypothesis testing, regression analysis, confidence interval, and
p-value calculations using R. This provided insights to enhance click-through rates and sales. Led Exploratory
Data Analysis, identifying trends using Tableau and Python libraries such as Matplotlib,Seaborn, and Plotly Dash.
Devised scripts for seamless data storage into Hadoop HDFS from varied sources like AWS S3, AWS RDS, Web
API, and NoSQL Database MongoDB. Leveraged Redshift for cloud-based data warehousing, optimizing data
storage, and query performance.
Implemented security measures to protect sensitive data during ETL processes, ensuring compliance with data
governance and privacy regulations.
Collaborated on end-to-end data processing pipelines involving ApacheSpark. Implemented effective data
serialization techniques in Spark applications for optimized data transfer.
Designed and managed Data warehouses, utilizing Azure Synapse capabilities for seamless integration and
analysis. Adept at monitoring and optimizing Synapse workloads to enhance data processing efficiency.
Proficient in configuring access controls and permissions in Azure ADLS, ensuring data security and compliance.
Leveraged Spark and Hive within the Big Data ecosystem to analyze extensive datasets up to 2GB stored in
Hadoop HSDF. This included executing filtering and aggregation using Spark SQL on Spark Data Frame.
Engineered Python scripts for robust data pre-processing in predictive modeling, encompassing tasks such as
missing value imputation, label encoding, and feature engineering.
Implemented machine learning models, notably Decision Trees and Logistic Regression, to predict revenue from
returning customers. This informed strategic promotion decisions by the marketing team.
Effectively communicated critical data insights to diverse stakeholders, utilizing tools such as MS PowerPoint,
Tableau, and Jupyter Notebook for impactful presentations.
Analysed and interpreted data using Power BI to derive actionable insights. Leverage Power BI for visualizing
and communicating data science results.
Environment: SQL Server, MySQL, Python, R, Pandas, NumPy, Matplotlib, Seaborn, Plotly, Tableau, Power BI,
Excel, Hadoop, Spark, Hive, Spark SQL, AWS S3, AWS RDS, Redshift, MongoDB, Azure Synapse, ETL, ADLS,
Jupyter Notebook, Machine learning, Predictive Modelling, DataVisualization, Power Bi, Statistical Analysis.
TechMahindra | Hyderabad, India July 2014 Oct 2017
Data Analyst
Responsibilities:
Gathered business requirements, definition, and design of the data sourcing, worked with the data warehouse
architect on the development of logical data models.
Ability to query data in a data warehouse and prepare data for reporting and insights automation needs.
Extract, cleanse, and combine data from multiple sources and systems using R and Python Programming.
Perform exploratory and targeted analyses, with a wide variety of statistical methods including cluster,
regression, decision tree/random forest, time series using Python Programming.
Built reports and dashboards to monitor KPIs (Key Performance Indicators) to understand drivers of KPI
changes.
Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the
process using python scripts.
Designed and executed analytic projects, generated insights to support business decisions using advanced
analytical and visualization techniques such as descriptive, predictive, and prescriptive analytics.
Generated graphs and reports using ggplot package in RStudio for analytical models. Developed and
implemented R and Shiny application which showcases machine learning for business forecasting.
Performed K-means clustering, Regression and Decision Trees in R. Worked on data cleaning and reshaping,
generated segmented subsets using NumPy and Pandas in Python.
Used Python NumPy, Pandas to perform data cleaning and data transformation activities.
Scheduled data refresh on Tableau Server for weekly and monthly increments based on business change to
ensure that the views and dashboards were displaying the changed data accurately.
Environment: Python, Django, Flask, Beautiful Soup, NumPy, SciPy, matplotlib, Pandas, JavaScript, HTML5, RESTful
API, MySQL, Agile Methodologies, Scrum, Git, Power BI, Software Development Life Cycle (SDLC), Version Control
Systems.
|