235 lines
5.2 KiB
Markdown
235 lines
5.2 KiB
Markdown
# First Experiment
|
|
|
|
Run your first machine learning experiment with Fetch ML.
|
|
|
|
## Prerequisites
|
|
|
|
**Container Runtimes:**
|
|
- **Docker Compose**: For testing and development only
|
|
- **Podman**: For production experiment execution
|
|
|
|
- Fetch ML installed and running
|
|
- API key (see [Security](security.md) and [API Key Process](api-key-process.md))
|
|
- Basic ML knowledge
|
|
|
|
## Experiment Workflow
|
|
|
|
### 1. Prepare Your ML Code
|
|
|
|
Create a simple Python script:
|
|
|
|
```python
|
|
# experiment.py
|
|
import argparse
|
|
import json
|
|
import sys
|
|
import time
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser()
|
|
parser.add_argument('--epochs', type=int, default=10)
|
|
parser.add_argument('--lr', type=float, default=0.001)
|
|
parser.add_argument('--output', default='results.json')
|
|
|
|
args = parser.parse_args()
|
|
|
|
# Simulate training
|
|
results = {
|
|
'epochs': args.epochs,
|
|
'learning_rate': args.lr,
|
|
'accuracy': 0.85 + (args.lr * 0.1),
|
|
'loss': 0.5 - (args.epochs * 0.01),
|
|
'training_time': args.epochs * 0.1
|
|
}
|
|
|
|
# Save results
|
|
with open(args.output, 'w') as f:
|
|
json.dump(results, f, indent=2)
|
|
|
|
print(f"Training completed: {results}")
|
|
return results
|
|
|
|
if __name__ == '__main__':
|
|
main()
|
|
```
|
|
|
|
### 2. Submit Job via API
|
|
|
|
```bash
|
|
# Submit experiment
|
|
curl -X POST http://localhost:8080/api/v1/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-api-key" \
|
|
-d '{
|
|
"job_name": "first-experiment",
|
|
"args": "--epochs 20 --lr 0.01 --output experiment_results.json",
|
|
"priority": 1,
|
|
"metadata": {
|
|
"experiment_type": "training",
|
|
"dataset": "sample_data"
|
|
}
|
|
}'
|
|
```
|
|
|
|
### 3. Monitor Progress
|
|
|
|
```bash
|
|
# Check job status
|
|
curl -H "X-API-Key: your-api-key" \
|
|
http://localhost:8080/api/v1/jobs/first-experiment
|
|
|
|
# List all jobs
|
|
curl -H "X-API-Key: your-api-key" \
|
|
http://localhost:8080/api/v1/jobs
|
|
|
|
# Get job metrics
|
|
curl -H "X-API-Key: your-api-key" \
|
|
http://localhost:8080/api/v1/jobs/first-experiment/metrics
|
|
```
|
|
|
|
### 4. Use CLI
|
|
|
|
```bash
|
|
# Submit with CLI
|
|
cd cli && zig build --release=fast
|
|
./cli/zig-out/bin/ml submit \
|
|
--name "cli-experiment" \
|
|
--args "--epochs 15 --lr 0.005" \
|
|
--server http://localhost:8080
|
|
|
|
# Monitor with CLI
|
|
./cli/zig-out/bin/ml list-jobs --server http://localhost:8080
|
|
./cli/zig-out/bin/ml job-status cli-experiment --server http://localhost:8080
|
|
```
|
|
|
|
## Advanced Experiment
|
|
|
|
### Hyperparameter Tuning
|
|
|
|
```bash
|
|
# Submit multiple experiments
|
|
for lr in 0.001 0.01 0.1; do
|
|
curl -X POST http://localhost:8080/api/v1/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-api-key" \
|
|
-d "{
|
|
\"job_name\": \"tune-lr-$lr\",
|
|
\"args\": \"--epochs 10 --lr $lr\",
|
|
\"metadata\": {\"learning_rate\": $lr}
|
|
}"
|
|
done
|
|
```
|
|
|
|
### Batch Processing
|
|
|
|
```bash
|
|
# Submit batch job
|
|
curl -X POST http://localhost:8080/api/v1/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-api-key" \
|
|
-d '{
|
|
"job_name": "batch-processing",
|
|
"args": "--input data/ --output results/ --batch-size 32",
|
|
"priority": 2,
|
|
"datasets": ["training_data", "validation_data"]
|
|
}'
|
|
```
|
|
|
|
## Results and Output
|
|
|
|
### Access Results
|
|
|
|
```bash
|
|
# Download results
|
|
curl -H "X-API-Key: your-api-key" \
|
|
http://localhost:8080/api/v1/jobs/first-experiment/results
|
|
|
|
# View job details
|
|
curl -H "X-API-Key: your-api-key" \
|
|
http://localhost:8080/api/v1/jobs/first-experiment | jq .
|
|
```
|
|
|
|
### Result Format
|
|
|
|
```json
|
|
{
|
|
"job_id": "first-experiment",
|
|
"status": "completed",
|
|
"results": {
|
|
"epochs": 20,
|
|
"learning_rate": 0.01,
|
|
"accuracy": 0.86,
|
|
"loss": 0.3,
|
|
"training_time": 2.0
|
|
},
|
|
"metrics": {
|
|
"gpu_utilization": "85%",
|
|
"memory_usage": "2GB",
|
|
"execution_time": "120s"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Job Naming
|
|
|
|
- Use descriptive names: `model-training-v2`, `data-preprocessing`
|
|
- Include version numbers: `experiment-v1`, `experiment-v2`
|
|
- Add timestamps: `daily-batch-2024-01-15`
|
|
|
|
### Metadata Usage
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"experiment_type": "training",
|
|
"model_version": "v2.1",
|
|
"dataset": "imagenet-2024",
|
|
"environment": "gpu",
|
|
"team": "ml-team"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
```bash
|
|
# Check failed jobs
|
|
curl -H "X-API-Key: your-api-key" \
|
|
"http://localhost:8080/api/v1/jobs?status=failed"
|
|
|
|
# Retry failed job
|
|
curl -X POST http://localhost:8080/api/v1/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-api-key" \
|
|
-d '{
|
|
"job_name": "retry-experiment",
|
|
"args": "--epochs 20 --lr 0.01",
|
|
"metadata": {"retry_of": "first-experiment"}
|
|
}'
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [Quick Start](quick-start.md) - Local development environment and dev stack
|
|
- [Testing Guide](testing.md) - Test your experiments
|
|
- [Deployment](deployment.md) - Scale to production
|
|
- [Performance Monitoring](performance-monitoring.md) - Track experiment performance
|
|
|
|
## Troubleshooting
|
|
|
|
**Job stuck in pending?**
|
|
- Check worker status: `curl http://localhost:8080/api/v1/workers`
|
|
- Verify resources: `docker stats`
|
|
- Check logs: `docker logs ml-experiments-api`
|
|
|
|
**Job failed?**
|
|
- Check error message: `curl /api/v1/jobs/job-id`
|
|
- Review job arguments
|
|
- Verify input data
|
|
|
|
**No results?**
|
|
- Check job completion status
|
|
- Verify output file paths
|
|
- Check storage permissions
|