Machine Learning Basics

Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn and improve from experience. This guide covers the fundamental concepts, algorithms, and practical aspects of machine learning.

Core Concepts

Types of Learning

Supervised Learning
- Classification
- Regression
- Training data with labels
- Model evaluation metrics
Unsupervised Learning
- Clustering
- Dimensionality reduction
- Pattern discovery
- Feature learning
Reinforcement Learning
- Agent-environment interaction
- Reward-based learning
- Policy optimization
- State-action mapping

Implementation Basics

Data Preparation

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load and prepare data
data = pd.read_csv('dataset.csv')
X = data.drop('target', axis=1)
y = data['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Basic Models

Linear Regression

from sklearn.linear_model import LinearRegression

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# Evaluate performance
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

Classification

from sklearn.ensemble import RandomForestClassifier

# Create and train classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Make predictions
predictions = clf.predict(X_test)

# Evaluate performance
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)

Common Algorithms

Supervised Learning

Linear Models
- Linear Regression
- Logistic Regression
- Support Vector Machines
- Perceptron
Tree-Based Methods
- Decision Trees
- Random Forests
- Gradient Boosting
- XGBoost

Unsupervised Learning

Clustering

from sklearn.cluster import KMeans

# Create and fit clustering model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

# Analyze clusters
from sklearn.metrics import silhouette_score
silhouette_avg = silhouette_score(X, clusters)

Dimensionality Reduction

from sklearn.decomposition import PCA

# Apply PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)

# Analyze explained variance
explained_variance_ratio = pca.explained_variance_ratio_

Model Evaluation

Metrics and Validation

Regression Metrics
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (R²)
- Mean Absolute Error (MAE)
Classification Metrics
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC

Cross-Validation

from sklearn.model_selection import cross_val_score

# Perform k-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

# Calculate mean and standard deviation
print(f"Mean CV score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Feature Engineering

Feature Selection

Statistical Methods

from sklearn.feature_selection import SelectKBest, f_classif

# Select top k features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected feature indices
selected_features = selector.get_support()

Model-Based Selection

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# Use Random Forest for feature selection
selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
X_selected = selector.fit_transform(X, y)

Feature Creation

Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

Custom Transformers

from sklearn.base import BaseEstimator, TransformerMixin

class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        # Custom feature transformation logic
        return X_transformed

Hyperparameter Tuning

Grid Search

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_

Random Search

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define parameter distribution
param_dist = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 20)
}

# Perform random search
random_search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_dist,
    n_iter=100,
    cv=5
)
random_search.fit(X_train, y_train)

Best Practices

Data Preprocessing

Handling Missing Values

# Fill missing values
data.fillna(data.mean(), inplace=True)

# Or use more sophisticated imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Handling Categorical Variables

# One-hot encoding
X_encoded = pd.get_dummies(X, columns=['categorical_column'])

# Or use sklearn's encoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
X_encoded = encoder.fit_transform(X)

Model Selection

Bias-Variance Tradeoff
- Model complexity
- Overfitting vs underfitting
- Validation curves
- Learning curves
Ensemble Methods

from sklearn.ensemble import VotingClassifier

# Create ensemble of different models
estimators = [
    ('rf', RandomForestClassifier()),
    ('svm', SVC(probability=True)),
    ('lr', LogisticRegression())
]
ensemble = VotingClassifier(estimators=estimators, voting='soft')

Advanced Topics

Pipeline Construction

from sklearn.pipeline import Pipeline

# Create preprocessing and modeling pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', RandomForestClassifier())
])

# Fit and predict with pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Custom Metrics

from sklearn.metrics import make_scorer

def custom_metric(y_true, y_pred):
    # Custom metric calculation
    return score

# Create scorer
custom_scorer = make_scorer(custom_metric)

# Use in cross-validation
scores = cross_val_score(model, X, y, scoring=custom_scorer)

Resources

Learning Materials

Books
- "Introduction to Machine Learning with Python"
- "The Hundred-Page Machine Learning Book"
- "Pattern Recognition and Machine Learning"
- "Elements of Statistical Learning"
Online Resources
- Scikit-learn documentation
- Online courses
- Tutorial series
- Research papers

Remember that machine learning is both an art and a science. Practice with real datasets, experiment with different approaches, and always validate your models thoroughly.

Understanding Machine Learning Fundamentals

Core Concepts

Types of Learning

Implementation Basics

Data Preparation

Basic Models

Common Algorithms

Supervised Learning

Unsupervised Learning

Model Evaluation

Metrics and Validation

Cross-Validation

Feature Engineering

Feature Selection

Feature Creation

Hyperparameter Tuning

Grid Search

Random Search

Best Practices

Data Preprocessing

Model Selection

Advanced Topics

Pipeline Construction

Custom Metrics

Resources

Learning Materials

On this page

Core Concepts

Implementation Basics

Common Algorithms

Model Evaluation

Feature Engineering

Hyperparameter Tuning

Best Practices

Advanced Topics

Resources