You would be required to code in Scala and PySpark daily on Cloud as well as on-prem infrastructure- Build Data Models to store the data in a most optimized manner- Identify, design, and implement process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc
Implementing the ETL process and optimal data pipeline architecture- Monitoring performance and advising any necessary infrastructure changes
Create data tools for analytics and data scientist team members that assist them in building and optimizing our product into an innovative industry leader
Work with data and analytics experts to strive for greater functionality in our data systems
Proactively identify potential production issues and recommend and implement solutions- Must be able to write quality code and build secure, highly available systems
Create design documents that describe the functionality, capacity, architecture, and process
Review peer-codes and pipelines before deploying to Production for optimization issues and codestandards
What we are looking for
Good understanding of optimal extraction, transformation, and loading of data from awide variety of data sources using SQL and big data technologies
- Proficient understanding of distributed computing principles
Experience in working with batch processing/ real-time systems using various open-source technologies like NoSQL, Spark, Pig, Hive, Apache Airflow
- Implemented complex projects dealing with the considerable data size (PB)
Experience with integration of data from multiple data sources-
Experience with NoSQL databases, such as HBase, Cassandra, MongoDB, etc
Knowledge of various ETL techniques and frameworks, such as Flume- Experience with various messaging systems, such as Kafka or RabbitMQ- Creation of DAGs for data engineering- Expert at Python /Scala programming, especially for data engineering/ ETL purposes